{ll)llSTERNATlONAL^.W PUBLISHEO UNPER THE PAt EN r COOPERATION TREATV (PCT) 



(Wy World Int«ligct»»i Property Organization 

intsrttiaionaJ Bureau 



iiiiiiiiiiiiiiiiiiiiiiiiiin 



(43) I«iterH»tion») F<til>ilc»tion Dale 
29 Novemb«r2O0i (29.11.2001) 



PCT 



(10) in(eniafio»di pBbScalion Namber 

WO 01/90346 A2 



<2}) !nt«niatiottal A{>pli«atio» ,Numb«r; PCTmSOlf 16831 
«2) ItttematioMal Fili»g Datei 23 May m] (33.05-3001.) 



(25) Ming Lnngn^ge: 

(26) Pubiicatifitt Language: 



(71) \f>plie,u»t ^ /V O) <,\U- 

FORNJ V {\STITl TF OF TFCHNOi OGV [I .^'l S , 
1200 i-asi CaJiforniB Bt.nilc\'aK3. Maii CoiJo 21<J'85, 
Pasatfena, CASH 'S (US). 

S (72) Jnveatws; a««J 

• ("ft tnsetttots/A})i>!tc-3HtSf/« { S^cu* VS /Si«n-(. »« 

s \OJGT, t l»ristophei> A, fl S/l Sj ^ S Moi At > 

I Avenue, Pasadetia. CA 9 U 06 i US). MiWO, Stephew, U 



iU.S/l.'Sl, -i.^O S. (.;jvin^.;>Ki Avoiiu,.'. f'^isatto, CA 9s 107 
ARNOLD, IViUJces, H. |l"S/US.|: 62<> .S. Oraoci .'\v- 
efjue. Pasadena, CA 9110,5 (US). 

(74) Agents: SCHAFFE8, Robert <.-i at., D.ijby i- IXuh.v PC . 
805 ThitiJ Awnuo, >Jevv York. Y UK):;-'5 i tU.S). 

(SJ) Oe»ig«si«JSJa)«« ('wi/'Wti,yr Ai- AO, Ni .VI xp, 
AZ,8A,BB,B«,H1CBX B/.r.-x.Cllf i,.u,< U CV. 
C/„ Oil, 1>K, DM, O/, lih, hi!, M, CiB, CD. Gh. OH. CiM 
HR, HU, ID, U., IS, Sl\ KK, KG. KP, KJJ, KK, l.C, i.K, 
LS, f.S. LT. LV. L\; MA, MIX MG. MK. MN. MW, MX, 
M'4 NXX N/, PL PT, KO, RU. SD, SE, SG, S!. SK, SL. 
T J, TM. I R. n. TZ. VA.. VXi. VS. VZ, VN, Yl.L ZA, ZW. 

(84> l)cMgnaj«i States f ^^^/wia?;: ARIK) i^rteni (Cjll. OM. 
KI-;, MW, MZ, SD, Si., SZ. IZ. U<f. ZW). Kufasian 
patens (AM. AZ, BY, KG, KZ, MD, Rl/, TJ, T.Vf.!, Eijropi»in 
paiani (AT, 81% OJ, CT, DK, FsS, Pi, PR, Oft. OK, TP,, 
TT, IB. MC, m.., H-, SE.', TR\ OAW. }.iaient (HE B.T, CF, 
CO. (U, C.M. GA, t iN. GVV. \5i . MR. NK. .SN. TD, fCi), 



tmm reoetpi c^tiiat mp<m 



{Cmtinitsii on acsi {kif^j 



(54) ntle: GHNIi RSCOMBlNA'nON AND HYBKiJ) PROimN Dl Vi i I PMi VI 



<.S7) AhstrjKl.t 'rheinveiU(^5nn:i;!li;sliririi-)i^m\fiT!t;t}Kiifsiiird« jx 1 r t it I it [ Ai tiul )iuii ft 



Uk meih<s(i.s of ihci(ivt't!t)on include iBiatviJcat ni..-n d rK<, ishn cr is < v^r (< >. tlKW 



at itefte itxKlions sire less Jikdy Ki rfi^rupt (JesiraWi; jjuipefhcs ot the raotejn. such as stability or funetKMiaiitJ?. 
I furtiier provides ittipiovetf ra£!Ehi.ids fiir directed evosutfon wherem t) t }x>i^ tnt 3s -aIa. neh reeofnb!)te<< it the «kr 
' tiOed "cswsovwr locations", to.sK>vi;t disttuptson pfwtlies tati be vis>eil to ufcHttti j }>-((. 1 0.4, ^los o%„ }ovitw()> St uttudj tlytoajos 
J (K It biomilyiner «its also he kkntifie*! and analvKed, ant- doniiiids catt ha oixanj/^d )it< -.tfi^nw Sthtraa djMupt on ptijft u in 
► be talcutatt-d, for exartipte hased on conlbniiationai energy or iiiieratomie dislMiit'. wd lliese cast be ujieci to luenftiy preterrcd or 
candidiie cttwsovfirlowitions. Ctimpi-iftrsystms torimplMsenti.'ig atiifvn Itn liuifsof ht jwtnton ti<. pttnuftd 



wo 01/90346 A2 liililiiiiiiiiilililii 



Far two-l0tter codes and otfier abhtm'iauims, n> tfm "Guiti- 
mce Moies on Ckides <md Abbreviaiiom'' i^siiearing at m begin- 
ning of each tvgulm- turn of the PCJ'Gasem. 



GENE RECOa'CSINATION AND HYBRID PROTEIN DEVELOPMEiN f 

Thh appHcation claims ^fiofity under 35 U,S,e.§l 19(e) to co-psnding U.S. 
Provisional Patent Application SeiM Nos. 60/207,048 (filed May 23, 2000), 60^35,960 
(filed September 27, 2000) and 6a/2S3,567 (filed April 13, 2001). 

Numei oiis references, mcludmg pateBts, patent applications Mjd various publicaiibns 
5 are cited and discussed in Ms speeiScation. The citation aad/or discassion of such 
references is provided to clarify the description of the invention and is not -m admission that 
any such refe):ence is "prior art" to the invention described herein. Ail ret«eaces cited and 
discussed in this specification are incorporated by reference in their entirety and to the same 
extent as if each reference was individually incorporated by reference. 

10 

1. JTELD OF THE INVENTION 

The invention relates to biomolecnlar engineering and design, including methods for 
die design and engineering of biopolymers such as protems and nucleic acids. 

More particularly, the invention relates to improved methods for in vivo and in vfiro 
1 5 directed evolution of biopolymers, such as polypeptides (e.g. proteiiLs) and oligonucleotides 
(e.g. DNA and RNA), Tlie invention is particularly suited to techniques wliich generate 
hybrid biopdlyniers by recoinbming isequencss of biopoiyiner bmiding blocks, such as 



sequences of amino acid residues or lUiclsic acid residues, from more than one parent 
biopolymer (e.g. from two or more parent genes). This can be referred to as "crossing" two 
(jr ffiore paretus io pioduce feooinbtnan: ofispring, V-adx location in the olTspritig where ihe 
biopoiynitir i^equerice changes or "crosses over' from one parent to auother is oalied a 
5 "crossover location" or a ''cut point." A related term, Imown in tiie genetic algorithm 
literature, is "schema." In the context of protein engineering, a schema is a represeiUation 
or arrangemciit of polymer building blocks, such as nucleic or amino acid residues, or 
recognizable structural domains or energetic conformations, in which each building biock 
contributes more or less to the stiucturai integrity, form, function, or fitness of the polymer. 
10 In. a recombination expetimeat, paresits may have similar or different schema, and the 
offspring may preserve or disrupt, the schema of one or more parents. In a preferred 
embodiment of the invention, .schema that are common to mo or more parerits are preserved 
in recombinant offspring. 

The invention provides computational methods for predicting beneficial 
15 recombinations of biopoiymers, e.g. the fragments, locations or schema of two or more 
parentgenes which can advautageQusly be recombined. Directed evolution rnethods can be 
selected and applied to &vor identified recombinations. By applying outpoints at locations 
that preserve schenus, the recombinant mutant library has a larger fraction of folded, stable 
hybrids or chimeras. Because tiie stabilitv' of the wild type is preser^'-ed, it is more likely tl^a^ 
20 mutants exist in this iibrar>' that have improvements in the desired properties:. 

For example, recombinant protocols can be modeled in silico to predict crossoves 
locations which will tend to preserve and not di.snip{, advantageous ."schema. 'Thi 
compmational or in silico techniques of the inveniion can be used to determine preferrec 
crossover locations. Residues of one or more biopoiymers are identified (e.g. nucleotide 
25 residues of a nucleic acid or amino acid residues of a polypeptide) vwhere crossove: 
recombination may produce benefscial resnlts, such as one or more improved properties 
Preferably, improventenis are obtained while minirnally disrupting a desired biopolyme 
property, such as stability or fiinctionalitj'. Disruption is less likely when biopoiymers ar< 



-2- 



cui and recombmcd at stmcturalh tuleraj^l crossox'ei Mtcs det^numed iKcoidiug to the 
mventio ) C'so^^sovci locations or pai, ^ .-^c Utfi'jd \sm;h if nd ti. hcisclui^i 

01 no jnpxict nn tic stability ot iht i:^!i-..ix:it>\i'^iQLdi AiucVin. oi h<. Snoiiolvnoi 
tcptC5cntode g, as so} cma.accordingTo specified thresholds urpaiamcteis. rbe^.nocanon<; 
5 can be used as candidate crossover locations for recombination experiments. Alternatively, 
sets of interacting residues or schema can be identified whiclt are collectively crucial or 
important to the structure of the biopoiymer, according to specified threshold or parameter. 
Crossovers that disrupt these sets of beneficially interacting residues or schema are- not 
desirable because they lead to destabilized structures, and thus can be luled our. 
10 The.se tecliniqt-e.s provide a targeted approach for obtaining niiitant or hybrid 

biopolymers v^idi improved properties using dirccisd cvoiutiL^i. For example, the invention 
is useful in the design ol'in vitro lecoinbiuation experimems vAere nucleic acid sequences 
that encode two oi niotc diffetent parent proteins may be recombined, to create hybrid 
sequences. Unlike other directed evolution methods, such as family shuffling, that require 
IS high .sequence identity or sin-iikrity (e.g. 70% or higher), the invention can be applied to 
parent proteins of low sequence simiiarity, e.g. less than 50%, or of no seqaence similarity 
(0%), For example, cut pomts for the recombination of proteins are selected based on 
preserving three-dimensional or cotifomiationai structure or stcucltaal motifs. Common 
structures or domains can be identified iudepeudently of amino acid sequence, or v,ithout 
20 requiring overall sequence siiiularity. Widely different .<5eque..oces may code for the same or 
similar structures or schema. Different proteins with different fi.mctions may have similar 
structures. Such proteins can be identified and selected as parents for crossover 
recombination, at selected cut points which preserve or minimiae disruption of cotitmon 
strucfttres. This improves the likeliitood of producing nuitaiits with functions or properties 
25 from more than one parent For example, a pfo^e of high activity may be recoaibined, at 
selected cut points, with a second structurally similar protein of high thermal stability, to 
produce a thermostable protease with high activity. By focusing on structural similarity and 
by minimizing structural disruption, the invention provides mutants having new or improved 



-3- 



properties, without neediag to rely on serendipitous results {rem raiidor.i l ecooibjriadons of 
parents having a high sequence similarity. Rscombtnation hfised on hybridisation or 
sequence identity can be called "horaoiogous" recombination. Recoinbination tiiat is not 
based on sequence identity can be called "non-homologous" recombination. The invention 
5 encompasses both methods, which can be used independently, or together. 

2. BAC KGROUND OF THE INVENTION 
The inventioo is concerned with polymers, primarily biopoiymers such as 
polynucleotides (chains of nucleic acids, e.g. DNA and RNA) and polypeptides (chains of 
1 0 amino acids, e.g. proteins and enzymes). More particularly, the invention provides improved 
hybrid proteins and methods of obtaining tfiem by crossover recombination. 

Proteins are polypeptides tfiat are useful to living organisms. For example, they 
provide structures in the body, do physical or chemical work, or act as catalysts for chemical 
i-e^ictious (i.e. as enzj'mes). Proteins are made by cells according to genetic information 
1 3 encoded, transcribed and translated by polynucleotides (DNA andRNA). litis often desirable 
to modify proteins so that they have new or improved properties. For example, a protein may 
be altered to increase its biological activity (e.g. its potency as an enzyme), or to improve its 
stability imdev diiYerent environmental conditions (e.g. temperature), or to change its 
fimction (e.g. to catalyze a different chemical reaction). 
20 Nature makes these kinds of alterations in many ways, including for example genetic 

mutations, or changes due to the recombination of genetic material such as occurs fron 
sexual reproduction. Changes that are beneficial tend to be preserved from generation tc 
generation, while truly harmful changes may disappear over time, in a process callec 
evolution. Changes which are neutral, i.e. neither helpful nor harmful, may also be preserver 
25 by default. This is a vesy long process, and tends to produce random changes which are thei 
tested for survival by the enviromTient. Scientists looking for proteins with improvet 
properties have had the very difficult task of searching for changes in proteins at xmdom 
from the vast numbers of potential natural sources that are available. Changes that ar 



loprovfdL iUiU fkisum Mati^rgi-ii-^. s .rnlion^i idinr' towatd 1) ^usot i ueic-i huuhcsc 
techniques also are exceedmgiy slow, costiy, atid resource intensive, ihey ai« very 
inefficient, and may not produce desired results. For example, proteins that act as enzynies 
5 to break down other proteins can be used as stain-reraovitig ingredients of a laundry 
detergerit, but ti^ese proteins rrtay have to worit at higher temperatures; than in nature, 

Identifying proteins with desirable characteristics from nature, Such as enzymes with 
improved heat resistance (thermal stability) or other fitness clmracteristics, has been a 
haphazard and difticitlt process. Accordingly, iiiere has been a need for new ways lo modify 
1 0 proteins, or the polyrmcieottdes which encode them, to produce new proteins with improved 
properties or fitness. Two separate tecliniques cosmnonly used to alter the properties of 
proteins ai^d other biological molecules are directed evoJution and computational design. 
The invention brings these techniques together, and in particular provides guided processes 
of genetic diversity that reduce tiie sequence space to be searched, are less prone to random 
15 results, and are more prone to produce proteins with improved fitness. According to the 
invention, preferred or optimal cut points for recombination, fmgment sizes, mc 
recombination strategies are provided. Structtaral informatiori about parent proteins, such a; 
biowledge of epitopes or active sites, or results of prior mutagenesis experiments, can be 
used to improve tiie outcome of protein evolution experiroents. Other .factors, such as iibrar 
20 size and landscape data (e.g. sttueture/ftinction relationships) can also be taken into account 
Principles of statistical mechanics are applied to genetic algorithms, to produo 
computational models of evolutionary processes. These models correlate with observation 
and experiments in directed evolution, and can be adapted to different experimental desi^u 
The computational models can also be used to provide a protein design model, whia 
25 generates candidate recombinants in siiico more rapidly than cotiventiosal in vitro method; 
thus allowing experimental parameters to be rapidly tested and optimized. 



-5- 



Directed EvQimiort 

Directed evolution techniques attempt to alter the. properties of a biopoiymer (e.g., 
a protein or a nucleic acid) by accumulating stepwise itnprovemeuts tlirough ileratioas of 
mndom mutagenesis, recombination and screening. See, e.g., Moore & Arnold, Nature 
5 Biotecimoiogy 1996, 1 4:45S; Miyazaki et aL, J. Mol. Biol 2000, 297: lOI 54026; Arnold, 
Adv. Protein Chem, 2000, 55:ix-xi. Broadly speaking, tiiese methods work by speediag up 
thi. natuiri.1 puico^^esof (.solution Changes in genetic material (e.g. mutations) are rapidly 
v-^d attr A uii! s Jid i^o ■< i ^^uM , ^ c^'N ' -J cm be easily and quickly grown in cell culture 
,e g. oauid'i tl c h x.O I ,f ^ . H30 ^ m itants are rapidly evaluated to identify new or 
10 improveL piopeitieb 0. cnanges ot ir.ic esi 

In a typicd in vitro protein evolution experiment, a naturally occurring or wiid-t^'pe 
protein is identified, and its sequence is altered to produce diversity, for example by mutation 
or recombimtion. This results in large numbers of mutant proteins, which are screened 
according to appropriate fitness criteria:, for exaxapie> the rnost active mutants that are 
1 5 reasonably stable may be selected. Ojae or more of tliese mutants may then be selected as a 
parent for another round of evolution. This process may be repeated as desired, for exampk 
uritil no fiiilher improvements in fitness are obsen'cd, 

Genetic recombination methods have been widely applied to accelerate in vitrc 
protein evolution. Examples include D'NA shuffiing, random-priming recombinatiori, ;mc 
20 the staggered extension process (StEP). See e.g., SteiBmer, Proc. Nati. Acad. ScL, 91:1 074'; 
(1994); Stemmer, Nature, 370:389 (1994); Zhao & Arnold, Nuckic Acids Res,, 2S:m: 

(1997) ; Zhao et aL, Nature Biotechnology, 49:290 (1998); Crameri etal,. Nature. 391:28} 

(1998) , Vdlsov et ah. Methods Emymoh 382:447-456 (2000). 

Some of the advantages of directed evolution methods are that they can be used witl 
25 large pol> mors, foi example proteins v^ith more tlian 500 amino acids; they produces tmiqu 
an>l I'n-vicc^'d w^x\V. and pjK.nj t can be evolved to achieve several goal 
Sim iHitu^nssl) Some cir\' u ^ ij --s *^^: -'^e le 1 e^ol. tu ,i is U-ntteJ bv dst- ce r^ii 
code. Fi>r example, there are sixty-four 3-base nucleic acid colons ihat code foi 20 amm 



-6- 



acids. A single niutatioa in a codon may not be enough for a wiU!--iype arniuo acid lo Ix- 
changed into aii i 9 other possible: amino acid.s. Often, two or more DNA rnutaiion? in die 
codon are requited. In directed evolution experiments, the DNA mutation rate is small and 
the gene is large, so the probability of obtaining two neighboring DNA mutations is small. 

5 Practically, this means that not aU amino acid mutations are possible using random 
mutagenesis alone. Nevertheless tlie nuniber of hybrids which caii be produced is vast, but 
even then they can not be made aiid screened as readily as would be desired. It is also 
difficult to produce simultaneous non-additive arrangements of sequences. .4 non-additive 
effect means that two or more simultaneous mutauons iiave to be made in order to observe 

10 a fitness improvement Often, the individual mutation.s lead to a decreased fitness. Because 
the mutation rate is small mid the gene Is large, there is a very small probability of obtaining 
the precise multipie-muiant needed to observe a non-additive change, and one that provides 
a benefit or fitness imptovexnent. 
Computaiioml: Design 

15 Computational design, by contrast, has developed separately from directed evolutior 

and is a ftmdamentally different approach. See, Street &, Mayo, Structure 7:11105 (1999) 
Unlike the essentially raiidom approach of directed evolution, computational design attempt; 
to predict and then make the clianges or mirtations that will be beneficial or iisefol. Thus 
the general objective of computational design is to identify particular interactions in a proteii 

20 (or other biopolymer) that lead to desirable propeities, and then niodif}' the biopolyme 
sequence to opdmke those interactions. For example, a force-fieid model can be used tt 
quantitatively describe interaclLons between amino acid residues la a protein. An amino aci* 
sequence may then be computed, at least in theory, to globally optimize these interactiom 
See e.g., Malakaukas & Mayo, Nalnre Stractural Biology, 5:470 (1998); Dahlyat & Mayc 

25 Science, 278:82 (1997). 

Some of the advantager, of computational protein design are that very large numbet 
of .sequences can be screetied in :.i:u\\ cj;, ; 0"' ;;u;i-ip]£ mutations can be coasidcre 
.simultaneously; and all possible amino acid sab3tituiion.s (ihe entire possible sequence .spaci 



-7- 



caii be- searchi'ci. Sotnt disadviiiitages are fhal crimptiiaiicinat requii-ccicnte increase 
cxponeQiiaHy with larger polymer sequences; at ieasi some slrucUira! inforrnatioEi (e.g. a 
detuied secoiidaty sequence) is needed; aiid certain unique or unexpected possibiiiiies may 
foe overlooked because the poiymer backbone is held constant for the calculations. In 
5 addition, it takes considerable if not restrictive computing power and computation time to 
calculate detaiied energies between aU jiossibie amino acid edmhinatioJJS. 
The Sequence Space 

Computational design can effectively search a large sequence space, that is, b. large 
number of sequences (e.g., > 10-*^). See, Dahiyat& Mayo, Science 278:82 (1997). However, 
H) the technique is currently limited by the size of ihe bio]wiymer. The l-argesi full sequence 
design accompli shed to date is a 28-rae.r zinc fmger protein (id.) , Partial designs can be done 
to improve the stability of proteins up to about 70 amino acids. Moreover, the technique 
currently is based on calculating the molecule's conformational energy, i.e. the relative 
energy of the molecule's folded and untblded states. Thus, current somputational methods 
15 have only been used to improve a molecule's stability. The teclmique has not been used tc 
improve other properties of biopoiymers, such as activity, selectivity, efficiency, or othei 
characteristics of biological fitness. 

Directed evoiudon methods, by contrast, have the benefit of improving my piopert) 
in a mclecuie thai cm be detected and/or captured by a screen, for example catalytic activity 
20 of an enz>'me. One effective and widely used directed evolution metliod involves produclioi 
of a librar>' of mutante from a parent sequence, e.g., by using error-prone PGR to product 
random poiatmutatlons: Moore & Arnold, Nature Biotechnology, 14:458 (1 996); Miyazak 
et al., J. Mol. Biol., 297:1015-1026 (2000). However, the technique is limited by severs 
factors, one of which is the practical size of the screen. Zhao & Arnold, Curr. Op. St. Biol 
25 7, 480-485 (1997), Increasing the number of mutants screened enables the user to sampl 
a larger fractionof possible sequences (a larger sequence space) and therefore provides bett€ 
improvements in the p! opeitie.s of intere«5t. However, the most mutants feat may be observe 
in any practical screen or selection is between about 10'' to 1 0", depending tipon the specifl 

-8- 



screening method. In conjparison, hov/evei, ao average protein of 300 residues will have at 
least. lO^"-' possible anvno acici combinations. Thus, my practical screening or selection 
assay can oniy search a small ftaction of the possible sequences. 

Moreover, the probability tliat any single random mutation wiH improve a property 

5 of the parent sequence is small, and the probahilit)' of improvement decreases rapidly when 
5Ttu llsple siraultaneous mutations are made, Furfibetmore, the neigligibk probability that two 
or thres mutaiions occur in a single codon. and tlie significant biases of error-prone PCR 
severely restrict Che possible ainino acid substitutions which may be searched. Again, there 
is a need to reduce the sequence space which must be searched in order to obtain desirable 

10 hybrids. 

Family Sfmjflmg (Recombination of Divergent Homologous Sequences) 

Accumulating point mutations in a single sequence is an effective fine-tuning 
mechanism for directed evolution, but other methods can also be used to create raoleculai 
diversity, e.g. pol>'mer sequences from v?hich useful sequences can be identified by screening 

15 or selection- Mutations can be produced in vitro using error-prone PCR methods. Benefioiai 
mtitations can, then be combined using genetic recombination methods. For example, i 
parent (e.g. wild-type) can be mutated to create a mutant library, which is then screened foi 
desirable mutants. These mutants can then be used ?.$ parent genes in recomblnatiot 
experiments. The mutant parents are cut into fragments and the &agments are recombinet 

20 to provide a librar)' of rccombinantmuiants. The recombinant mutants can then be screenet 
for beneficial or improved properties. 

Recombination canbe done without mutagenesis of a common parent. For example 
two or more different but related patent genes can be recombined in a method known a 
"family shuffling" or "DNA shuf^ing." Related sequences, e.g. &om divergent faomologou 

25 genes, can be cut and recombined to make hybrid genes. These methods generally rely O! 
im assuirpliun ih£<t tht f*nt *eue= share cio'-^h teiated ^^'xucture^ Rec, <^.%. . Stcmmei 
.Vn,'.'^, 370 AA,e^<' < It ^ i k ' , 4^'''4=6 POJO 

Ciammijfa/, A(3iWK', 391 '88(1998) iheshufflmgp oce^i iieaie.>ai bi?r> oi ni<.in, 'ic^ 



-9- 



genes which code for proteins with sequence inforoiation from any or all parents. For 
example, the first half of the sequence might come from oiie paretst, while tlie second half 
might come from another. Another hybrid might have tiie first 20 nucleotides from one 
parent, the next 500 from another pas-ent, and tiie last nucleotides from a third parent. The 
.5 point at whicii a sequences derived from one parent switches to a sequence derived from 
anotlier parent is called a crossover. There may be one or more crossovers in a given 
sequence. 

A Ubrar>' of such hybrid genes might contain miUions or trillions of different genes 
containing different patterns of crossovers. In family shuffling, genes from multiple parents 
10 and even from different species can be reconibitied, operation.? that do not occur in nature 
but which may nonetheless be useful for rapid adaptation. DNA shuffling is being used to 
generate improved proteins, and notably, proteins with featmes not present in one or all 
parent proteins, or not even known to occur in nature. See, Affliolter & Arnold, 
"Engineering a revolution," Cbetnisiry in Britam, 35: 48-5 i (1999); Ness et al, "Moiecuiaj 
1 5 Breeding - the natural approach to enzjme design," Advances in Prole'm Chemistry, 55 :26 1 • 
292 (2000); Schmidt-Dannert, et al.. "Molecular breeding of carotenoid biosynthetic 
pathways," Nature Biotechnology, 18, 750-753 (2000). 

DNA shuffling metluids rely on hybridization between portions of the parent gene: 
and can therefore only recombine closely related sequences, usually of more than 70?' 
20 sequence identit>^ Furthermore, these methods generate crossovers between one paren 
sequence and anotlier only in regions of the gene where there is high identity between the tw. 
sequences. Stated another way, recombination based on DNA sequence similarity require 
overlap in tlie DNA between parents for a crossover to occur. The DNA of the parents i 
fragmented, and in order for the fragments to reanneal, they need to share some overlap t 
25 allow for DNA hybridization. The StEP protocol does not require as much overlap as th 
DNA shuffling protocol originally proposed by Stemmer. A variety of other shufflin 
techniques are also known, some of which do not require sequence identity or alignment 



-10- 



These iiiclude for ex-ampie the H CUY piok>col . Oa1c-nr:e.k'.r et aj . . i-^ ioo- g^inic & Medic; ml 
Chcm. 7:2130-2144 (1999); Ostermeier et ai.. Nature Bioiechnoi. 57:5205-1209 (1999). 

Many proteins iiaving similar fhxee-dimensional structuies show low or even no 
discemable sequence identity or similarity. Rational design (Mitra et al, Biochemistry, 32: 
5 I2959-12967 (I993)i Shimoji ct al. Biochemistry, 37: 8848-8852 (1998)), computational 
approaches (Bogarad & Deem, Prdc. Naii AcM Scl USA, ^6:2591 -9S(1999)i and 
combtnatorial methods (Gstemiier e( al. , Nature Biotechnology, 17: 1 205-1209 <1 999)) have 
shown that functional proteins can be obtained by recombination of such distantly related m 
low sequence similarity parent sequences. Accordingly there is a need foj methods that car 
10 provide stable and fiitictional hybrids from recombined parents having low or no sequence 
simiiaiity or identity, but having tb'se-dimensional stroclwes in common. 

Recombination caii be pertormed using so-called "noa-komologous" methods tha 
do not need sequence identity or overlap, because tiie experimental protocol reiies on othe 
properties, and does sot require DNA hybridization between the parents. Generally, tw< 
1 5 parents are recombined with a single crossover point using such methods, li recombinatioi 
is restricted to aslrtgle crosso ver point l^etween {wo parents, tlie crOvSsover disruption of th« 
recombinant mutants may be ver^' subsstaittially increased, leading to a library of less-stabl< 
mutaitts. According to the invention, non-homologoua recombination protocols can b 
modeled or used together with improved and targeted computational methods to calculat 
20 crossover disruption protlles. These can be applied to favorably restrict crossover location; 
minimise dismptiori, and select crossover regions and mutants that are more likely to b 
stable, and/or exhibit improved fitness. 
Functional Crossover locations 

Random selection of crossover sites, as in eonventional iamUy shuffling, does m 
25 fevor sites tliat are more likely to produce functional and improved mu^aI^ts. Accordiagl: 
methods of selecting promising crossover sites are needed. It has been empirically observe 
that functional shuffled sequences do iwt contain an even distribution of crossover loc^lioi 
throughout the sequence. For example, tiie crossover locations of some in viiro recoinbina: 



-11- 



muututs aic otrojul} 'iKi^oo lov.arJs the N- and C~ lennms of the resulting functional 
pant;ir^ A'„.> >.yr,»> {"-SO^-fiOf, Many of these crossovers 

at the lermini do not, however, lead to functioiml improvements. 
Segumce Databases 

5 Given the explosive growth in the gene databases due to tlie exhaustive sequenciBg 

of large numbers of organisms, the sequetxces of homologous genes are easily accessible. 
However, to date, there is no dgorous method in the art to quantitatively use the information 
in sequence databases to identify optimal starting parents for tecombination (e.g. shuffling) 
experiments. A method to rapidly and quantUaiively use such information is desirable. It 
10 is further desirable to have methods that predict where crossover locations in recombination 
experiments are likely to generate functional proteins which also may have new and iiseM 
properties. Such methods would be usefiil for the creation of more diversity in a 
recombinant iibrai>', with a reduction in the numbers of mutairts needed to be produced and 
screened. Methods that would address these aixd/or other problems in the art would allow 

15 the acceleration of in vitro protein evolution and would accelerate the creation of new 
proteins (e.g. enzymes) with novel and useful properties. This is of particular interest to 
those interested in improved protein-based drugs, and in the use of enzymes in industrial 
processes where eiizymes must fimction ui non-native envii-omnents or must catal>'7.e 
non-natiYC chemical rcactioa<j. 

20 

Thtis, there is presently a need in the art for improved methods of designing 
biopoiymers such as proteins and nucleic acids, Moreover, there exists a need for bettes 
methods for improving one or more properties of a biopolyraer. There further exists a neec 
for improved methods of directed evolution tha^^ 
7.5 of the above-described problems in the art. For example, there is aneed in the art to identiff 
regions in the sequence of a molecule (e.g., a biopolymer such as a protein or nucleic acid 
where crossover recombination is likely to generate a libraiy of stable mutants or chimera 
that CiU5 be screened for one or more beneficial and/or improved properties, 



-12- 



3. SUMMARY OF T Hfc^IN VEN'i i* >\ 

/\ppi!OtJitt hf-vo JiSv^vtii^d liat produtiiic:, nuL^rr ^lopolV!nct^ l\ )'^sovcr 
recomb'iiauo.i di certain cut puuil or locaJinns is more hkelj to prescn'c stabJit^j and'oi a 
desired property of the polymer, such as functionality, than crossovers in oilier areas. The 
5 crossover bcations are identified by examining at what locations a crossover disrupts a 
schema structaral domain or a xnimmum cf coupling hiter^ctibns between arniuo acM side 
chains of the polymer (e.g. polypepttde). The iiiveatioa provides riovei teeteiques for 
identifying residue locations where crossovers would disrupt a minimum of schema or 
couprmg intsracuoRS in a polypeptide. These methods are straightforward and are 
iO comptsfarionatly tractable. 

Accordingly, a skilled ailisan can readily use the metlrods to identity residues of e 
pariicuiar polymer sequence that permit crossover recombination with minimal disruption 
Tiie artisan iiiay selectively recombina polymers at the identij5ed crossover locations tc 
generate reoombinam mutants tliat are Ukely to be fimctional, and which can he screened fos 
1 5 properties of interest. Such mutants are more likely to have one or more properties o, 
interest that aie improved over the properties of the parent polymer. Thus, by selectively 
recombiniiig parent genes at identified crossover locations e.g. in siUco, a skilled aitisan ma^ 
more readily and efficiently identify novel sequences with improved properties than if tfo 
artisan used randomized methods or conventional shuffling. 
20 The invention therefore provides metliods for selecting residues of a biopolyme 

sequence for crossover recombination by obtaining or determining Vk^hich locations disruf 
astmctural domain or a minimal amount of coupling interactions in the amino acid sequence 
and selecting the identified crossover locations. The polymers may be any type of polyma 
including Mopolymers such as, but not limited to, nucleic acids (comprising a sequence c 
25 nucleotide residues) and proteins or polypeptides (comprising a sequence of amino aci 
residues). 

The in verition also provides methods for tlie directed evolution of biopolymers. Tw 
or more parent sequences ai-e provided, each for example having one or more properties ( 



-13- 



interesl, aiid one or more possible crossover locations. Otie or more recombinant polymers 
may t.ben bt.- gstier,-^ied from the parent polymer sequences, in svhich two or more of the 
parents are i:econ:\bined at one or more selected crossover locations. These mutants are 
preferably screened for the one or more properties of interest. Mutants are selected where 
5 one or more properties of interest is modified and preferably is improved. In certain 
embodiment3, tlie inetliods of the invention are iterativeiy mpeated, and selected mutants are. 
used as parent polymer seqtiences in subsequent iterations of the method., 

The invention can also be u,$ed to identify optimal parent molecules (e.g. preferred 
parent genes) for recornbiirdtion. Simitar or strnctnially related parent molecules can be 
10 evaluated to determine svhich are more likely, when altered, to produce desirable 
improvements , For example, optiaiat parents cm be mined from sequence databases, e,g. 
using disruption energy as a measure. 

Computer systems are also provided that may be used to implement the analytical 
methods of the invention, inchiding metliods of identifying crossover locations in a polymer 
1 5 sequence and/or selecting such residues for m.utation (e.g., as part of a directed evolution 
method) , These computer systems comprise a processor interconnected witii a memory that 
contains one or more software components. In particular, the one or more software 
components include programs that cause the processor to implement step.? of the analj'tica! 
methods described herein. The software components may further comprise additiona 
20 prograajs and/or files including, for example, sequence or structural databases of polymers 
Computer program products are fuitber provided, which comprise a compute; 
readable mediumj such as one or mors floppy disks, compact discs (e.g., CD-ROMS o. 
RW~CDS), DVDs, data tapes, &tc,> tiiat have one or more software componeate encode* 
thereon in computer readable fonn. In particular, the software components may be loadec 
25 into the memory of a computer system and may then cause a processor of the compute 
system to execute steps of ih& atialytical metliods described herein. The software 
components may include additioiial programs and/or files including databases, e.g., o 
polymer sequences and/or structures. 



-14- 



4. BRIE^DESCRIFTION OF THE DRAWINGS 
FIG. I is a flow diagram iUustratiag exempian' recombination embodimeiits of the. 
methods of the itweutioo. Fig. 1 A illustrates a niethod for determhiiiag a sciiema disruption 
profile. Fig, IB illustrates a method for modeling an experimeiitai recombinant protocol. 
5 FIG. 2 is a schematic illustration and graphical representation of crossover 

disruption. 

FIG. 3 is a gene alignment for p-lactai»asc-Uke genes, (i) Enterobac(&r cloacae, (2) 
Cftrobacierfreundif, (3) Yersinia enterocoUHcamid{i) Kkbsieiin pnemimia. SWISPROT 
or TrEMBL rKxeur.oi; :;i.n.ber3 for Ihe protem sequences and Gr-iBarJc accession numbers 
10 for tlie DNA secjuences Aic given. 

FIG 4A is an in silico probabilit}' distribution for all crossover locations calculated 
from a recombination algorithm tor the four p- lactamase seque ;C<;s of f 'lG, 3. FIG 4B is 
an insilico probability distribution of crossover iocatioas for j3- k -taniase when sci^eaed fos 
crossover locatioi\s that nieet a set threshold. In this example, recombinajit mutants are 
15 below the threshold 14. The dark horizontal bars on tlie x~f :is indicste the crossoverj 
observed in prior in vitro experiment. Cramers et al„ Natw: 391:288 (1998). These 
cufx'es were c-aicuiated using Method i of the invention, describe • below. FIGS. 4C and4E 
are similar to FIGS. 4A and 4B, but were calculated using Method 2 of the invention 
described below, 

20 FIG. 5 is acrossover disruption plot for non-hoinoiogotis recombiimtion experiments 

using the ITCH Y protocol, with giycinamide ribonucleotide trajisformylase. The sequeno< 
range 50-100, where recombinations were restricted in the experiments, is shown on th< 
X-axis. The crossover disruption is shown on the y-axis. 

WIG. ^ shows a probability distribution for schema disruption in computationall; 

25 generated recombinant mutants. The probability distribution of the schema di.sruption i 
plotted for the recombinant mutants il -.a .^.miain least three jrhirents and is normalized b 
the total number of nR3t;)ntj!. ViiKh :T; nr:.-.- ;cr-<;-:icr;LS tiv xlicma disruption of th 
portion of the recombinant mutants that cornain each paieat sequence; (1) Eufarobacle 

-15- 



Th>. pO ■'iOf C 1- i Istj.in<>M ntfitk 

ofthuNtiCi' u K v^Lit^fuaDi rapliou S, tij In U ic pn Eh ^.l^.h utu^n nt^>n \^ 
equence correspond wito the }east-dis aptive schema Ihc adcsitsoii d tU \uMnia 
e'Uet Qcoli'tca (3) sequence caxiie^s tl e mo:>t t. *o.ad disi -.pi on t \p <n nug t \\di> not 
observed in the functional hybrid proteins found in DNA shuffiing experiments. The inset 
bar grapii shows the integral between the schema disruption cutoff and zero. This represents 
the fraction of low-disraption schema associated with each parent. 

FIG. 7 is an example of a,n ;/? vilro method of overlap extension reassembly, 
targeting identified eroflsover locafions. The appropriate fragrnefits !»ay obtairie-d by 
split-pool synthesis. 

FIG. 8A shows a fragment reassembly method using a parental template. The 
resulting prodacts are subjected to heteroduplex recombiimtioa (\^o!kov et al, Nud. Acids 
Rss.^ 27:18 (1999)) to create libraries of genes within regions of aoa-ideatity. More 
complexity can be introduced by the addition of more fragment during template assembly, 

FIG. 9 shows the preparation of gene fragments prepared by PCR with primers 
directed to regions targeted for crossovers. 

FIG. 10 shows recombination directed to specific sites using crossover primers in 
DNA shuffling. 

FIG. 11 shows an exemplary computer system tlmt may be used to iinplemeat 
analytical methods of the invention. 

FIG 12 is a flow diagram illustrating one embodiment of a recombinant searcl; 
algorithm of the invention, based on sequence identity, 

FIG. 13 is a diagrammatic illustration of a computational algorithm u.^ed to generate 
recombinant mutants by DNA shufiling. <A) First, cut points are distributed randoml> 
across the gene with probability p^. In this diagram, the arrows mark cut points and the 
thatched line represent regions of sequence sinxilarity between parents. (B) A parent i; 
picked at random to determine tiie first fragment. The next fragment ii; cho.sen amongst tht 

-16- 



parents i; lal si e <id(.-cuiau-.' .sequence identity (includiag the patent of the previous fragment) 
v/ith eqiKi! piobablhiv. {C.} The complete library of recombinant mutants that can be 
generated by the cut pattern shown. 

FIG. 14 is a flow chart of an exemplary algorithm for directed evolutiottexperitnents. 
5 FIG. 15 showsaqtjsntitativecomparisonoftheenergy(x-axis) and distance (y-axis) 

based calculatioris of crossover disruptioa fer Traiistbtmylase, Aa energy cutoiF of 0,2 
kcal/mol and a distance cutoff of 4,0 angstroms were used. The data fits alihear correlation 
with if' 0.91. 

FiG. U» show.s acomparIsoru)f crossover disnmtioncaicuLations for Trar^^^^ 
10 based tun the distance (top) and energy (bottom) definitions of coupling. An energy cutoff 
of 0,2 kca!/mol and a distaijce cutoff of 4.0 angstroms were used. The qualitative shapes of 
both plots are similar, 

FJ€, 17 shows the crossover disruption of inserted pfaytase domains. The distance 
cut off <4 was set to 3,0 angstroms and the crossover disruption was normalized according 
1 5 to Equation (3). The expenraental parameters are as reported by Lehmann and co-workers 
(2001). 

FIG. 18 is a schematic of the hierarchal process of protein folding. First, tht 
unfolded polypeptide rapidly collapses ("bursts") into substructures. Next, the substTucturef 
condense to form the tertiary structure of the native protein. It is utidesirable for crossovers 
20 to disrupt compact units that nucleate the remaining structure C'buildmg blocks" 03 
"schema"). 

FIG. 19 is a schematic demonstrating the utility of a contact map in identifyinf 
compact Uiiits of substructure. A representative contact map is on tiie left. Tlie graph on tht 
right is a statistical study of the average length of contiguous residues thai can fold into : 
25 sphere of the indicated diameter (Gilbert 1998). This information can be used in th^ 

following way. If a IS-residue segment can fold into a sphere with a diameter of 2 
angstroms, then this segment could be considered as being of average compactness 
However, if a 20~residue segment can fold into a sphere of 21 angstroms, this is considere- 

-17- 



inn ) 'ii u t!i K t (nt 1 ay c )ni i Ij t. i t i i v( u d t ii\h 0 \ i mq 
A 1 hi ii K' iui lorm^d b\ the cut point tc {unt t:it nti Ui s ct t, 

ihc t>t.£iu<rn lits mto a sphere oi the 'jpec .ltd diaint^cr rhtu t-n, triJii,it wtl^ h entireis 
white (interacimg). 

5 FI<». 20 IS a comparison of (A) trie Go-algontlim (usmg a diameter sua d,„„ ^ 21 

angstroms) with (B) the I <J crosfcovet disrujjtion piofi le of ti ai^fo m> la^e 1 ftc i lO-algontl jn 
predicts that thcr cuc thtet, ooma n foiming regions m the itructu e wheit^^ tht Id 
crobsove dt'itupUonp nfil hi sholdesiirgN ofO''kca|/noi)demonstratc<; hatoneofihtse 
doman foil uu_ <" ni j d because u wubt;i> loj n uJi tn i n 

10 TKt 21 vj ditun ioi\«iawtai,t.mapo*betc. ia>. lunNt- 1 <^n^ ^ Bl ick 

reaions mdiv-Cute resides that ire furthv-r ihaii 2 ' met uii^ -ifM t a id vnt r^, jdi mdjcatv 
residues tliat are closer than 21 angstroms, Tiie lines indicate the approximate locations of 
crossovers observed experimentally by Crameri et al (1998). 

FIG. 22 provides an analytical description of Go's algorithm for determining 
1 5 domains based on the contact niap. The domain diameter d^ss ^ 2 1 for these caiculations and 
Equation (8) is used to deterraiiie the domain-forming ability of each residue. Low regions 
in this graph indicate suitable places for domain boundaries. The thick black horizontai lines 
indicate the approximate domain boxmdai'ies idcntiiled by this method and the tliin vertical 
lines demarcate the regions where crossovers were observed experimentally by Crameri ei 
20 al (1998), The domain algorithm identifies some of the general structure of where tlk 
crossover occurs, but makes a poor prediction overall 

FIG. 23 shows an algorithm that combines the concept of disrupting a domain witl 
the concept of disrupting coupling interactions. First, all fragments of size and greate; 
are identified in the sttuct-ure. Next, the fragments that fold into a sphere of diameter m.i 
25 are coupled to the remainder of the structure above a thresliold disruption value an 
sepaifited Finally, the schem i !i--\;-L- .\ - o.iue of all the residues involved in the interactinj 
compact unU rue iiKr.-:. - c J c ungtUat ciosArveis that occur in this region wil 

disrupt a "building bloctC" and therv^forc be dcstabiliznig. 



-18- 



FIG. 24 shows the schema disruption profMe as detcrmmed from the traiisformylase- 
stnicture. (A) No sequence ideatity was considered (P, « P, t in Equation 3), The 
parameters aie4.== 4.0, 4.0. (B) Sequence identity is considered (Equation 3). The 

parameters are ~ 4,0 and jEe.;Aw.vft ~ ^ The normaiization of crossover disrupdoa in both 
5 graphs wa.^ according to Equation {6). 

FIG. 25 show.s the schema disruption profile as determined fronn the }>eta-iactamase 
structure compared with (Ik- e\pertmerita!!y observed crossover points (thick, horizoniai bars) 
(Crameri el ;i1 i 99S). (A) Tite. prollie a.s determined from the domain ai£?oritkrn alone- svith 
1\ angstroms :ind - 15 residues (EquatioaP). (B) The profile with disruptive 
10 domains removed where the cros.sover disruption was normaiized as in Equation (3). The 
crossover disruption threshold was set to be £:^,,a,«a O-OQ"? (oorresponding to a Z-score of 
0.1). No sequence identity was considered - 1 in Equation 3). (C) The profile wth 

disruptive domains removed where the crossover disruption was normalized as in Equation 
(6), The crossover disruption threshold was set to be E^„^^,, "'' ^-^ (corresponding to a Z- 
1 5 score of 0.4). No sequence identity v/as considered (P, P, =^ 1 in Equation 3). (D) The 
same profile as in (C), except sequence identity is considered (Equation 3). The crossovei 
disruption threshoid was set to be 0.6 (corresponding to a Z-score of 0,2). 

¥IG. 26A sliovvs a schema disruption calculation of the P450 2C5 structure 
Equation (10) was used to generate the graph and tlie crossover disrtiption normaUzatior 
20 scheme of Equation (3) was used. The parameters for this calculation are 4, === 4.0, E^ii^^ = 
0,005 (corresponding to a Z-score of 0.3). The red lines indicate where expertmentalb 
generated single cut poititreconabiitatxonevente led to folded cMme^^ 
The arrow indicates the location of the crossover that resulted in a folded P450cam-P45t 
2C9 chimera (Shimoji et al, 1998). Mote that not all of the residues were resolved in th* 
25 structure, so the numbering starts at 30 (e.g., residue 1 in the graph is residue 30) an* 
residues 212-222 are missing. FIG. 26E shows a schema disruption calculation of th 
P4S0cam stiucture. Equation (10) was used to generate the graph and the crossove 
dismption nomialization .scheme of Equation (3) was used. The paranieters for thi 

-19- 



calculation are <t 4,0, ^v.rw;, ^ ^-^^^'^ (corresponding to a Z-score of 0.65). The red line 
indicates the i(jc;nion of tiu- crossover fhnt resuUed in a folded }'45()canvP4502C9 cMmera 
(Shtmoji ei ai, 1998), Note that not all residues were resolved in the structure; residue 1 in 
the graph is residue 7 in the .structure. No sequence ideatily vm considered for either P450 

5 calculation {P, ~P^~\ in Equation 3). 

FIGS. 27A and 27B illustrate a method for detennining optimal parents for crossover 
recombination by anaiyzitig the schema disruption experiment for a DNA shuffling 
experiment with beta- lactamase (Crameri et al., 199S). The parents in this example are: (1) 
Enterobacter cloacm, (2) Citrobacterfreundii, (3) Yersinia mterocoUiica, and (4) Klebsiella 

l(i pneumoniae, 

5. DETAILED DESC RIPTION OF THE INVENTION 
llie invention overcomes problems in the prior art and provides novel methods 
which can be used for directed cvolutioiiof Mojjoiymers such as proteins and nucleic acids. 
1 5 In particular, the invention provides meiiiods which cait be used to Identify candidate 
locations in a biopoljmer for crossovers, such that the biopolymer (e.g., polypepiide) will 
likely retain stabilitj' and ftmctionality while allowing crossovers to occvir. By generating 
hybrids that are recoinbined ai aelocied candidate crossover locations or cut points, mutanl 
or hybrid polymers having one or n\ore improved properties may be more readily identified 
20 while simultaneously reducing the numbei-(s) of mutants screened. 

Details of tlie invention are described below> including specific examples. Thes<: 
examples are provided to illustrate embodiments of the invention. However, the mventior 
is not limited to the particular erabodiraenus, and many modifications and variations of the 
invention will be apparent to those skilled in the art. Such modifications and variations art 
25 also part of the invention. 



-20- 



5.1 Defmitious 

The terms used in this sjX'cification generally have their ordinaiy meanings in the art, 
withiii the context of this invention and in the specific context where each term is used. 
Certain terms are discussed below, or elsewhere in the specification, to provide additional 
5 guidance to the practitioner in describing the compositions and raeihods of the invention and 
ho-w to make ajjd hoVi? to use &etn. The scoipe an meaning of any use of a tenn vviH be 
apparent from the specific context in which the term is used. 
Mokctdar Biology 

The term "moiecuie" means any distinct or distinguishable structural unit of matter 
1 0 compiisiag one or more atoms, and includes, for example, poiypeptides and poiytrycieotides. 

The term "polymer" means any substance or compound that is composed of two or 
more building blocks {'niers') tliat are repetitively iiaked together. For example, a "dimer" 
is a compound in which two building blocks have been joined togtlier; a "trimer" is a 
oimpound in which three building blocks have been joined together; etc. 
1 S A "biopoiymer" is any polymer hax'ing an organic or biochemical utility or that is 

produced by a cell Preferred biopolj'raers include, but are not limited to, polynucleotides, 
poiypeptides and polysaccharides. 

'l-hc term "polsTiucleotide" or "nuckicacid molecule'' refers to a polymeric molecule 
having a backbone that supports bases capable of hydrogen bonding to tjpical 
20 polynucleotides, wherein the polymer backbone presents the bases in a maimer to permit 
such hydrogen bonding in a specific fasliion beti-veen tiie polymeric moiecuie and a typical 
polynucleotide (e.g., single-stranded DNA). Such bases are- tjpically inosine, adenosine; 
guaiiosine^ cytosine, uracil and thymidine. Polymeric molecules include "double stranded' 
and "single stranded" DNA md RNA, as well as backbone modifications thereof (foj 
25 example, methylphosphonate linkages). 

Thus, a "polynucleotide" or "nucleic acid" sequence is a series of nucleotide base; 
(also called "nucleotides"), generally in DNA and RNA, and means any chain of two or mort 
nucleotides. A nucleotide sequence frequently cames genetic infomration, inciuding th< 



-21- 



information used by cdluhir maduiieiy lo make proteins arsd enzymes. The terms jnciude 
genomic DMA, cDN A, ILN A, any syntheUc and geneticaiiy tnantpuJaied polynucieotirie, ?.nd 
both sense and aiuisensc potynucleotides. This tnciudes single- and doubic-strandfjd 
molecules; i.e., DNA-DNA, DNA*KNA,and RNA-RKA hybrids as welias "protein nucieic 
acids" (l*NA) formed by conjugating bases to an amino acid backbone. This atso includes 
nucleic aeids ixsritatning modified bases, for exanipie, tliio-uracil, thiO'^guamae and 
fluoto-uracil. 

The polynucleotides herein may be flanked by natural regulatory sequences, or may 
a^so<,iU u .V V . t Mu^k^i^ nhoOCv.!^ lesporsc 

elements. Signal ^cquci te% ^ ii s ; <u d3 u ut cod}rp.re£>inn-> 

and \ne hke ihe rurletc -it. . • , * f .J * i c u i n-^ -.a ir 0 i. 

Non-l'mumgexampIesofbuunnodJlcaUonjjjnciuL sntiliylc'tio i. 'tapi,' ,i>uD^litut'oiiolcne 
or more of the naturally occurring txucleotides with an analog, and intemucleotide 
modifications such as, for example, those with uncharged linkages (e.g., methyl 
phosphonates, phosphotriesters, phosphoroamidates, carbamates, etc) and with charged 
linkages <e,g-, phosphorothioates, phosphoroditfaioates, <?/<?.)• Polynucleotides may contain 
one or more additional covalentJy linked moieties, such as proteins (e.g., nuclea-ses, toxins, 
antibodies, signal peptides, poly-L-lysine, etc.), intercalators (e.g., acridine, psoralen, eta), 
chelators (e.g., metals, radioactive metals, iron, oxidative metals, etc. ) and aikylators to name 
a few. The polynucieotides may be derivatized by formation of a methyl or ethy! 
phosphotricster or an alkyl phosphoramidite linkage. Furthermore, the polynucleotides 
herein may also fae modified with a label capable of providing a detectable signal, eithei 
directly or indirectly. Exemplary labels include radioisotopes, fluorescentinoiecules, biotii 
and the like. Other non-Htnitbgexamplesofmodifi^ 
below, in the description of the invention. 

The term "oligonucleotide" refers to a nucleic acid, generally of at least 1 0, preferably 
at least 15, and more preferably at least 20 nucleotides, preferably no more than 10( 
nucleotides, that is hybridizable to a genomic DNA molecule, a cDN.A. molecule, or aj 



-22- 



Olmorj\5e<nide". o in 1\- labo.cN;, : g . v, .'h '-P-niu ieotid.v tkk e >Ucc- m hr h .s iahe. 
sucli as biotin or 3 fmoresceat dye (for example, Cy3 or Cy5) has been covaientiy ooajUgated. 
In one embodiment, an oligonucleotide can be used as PCR primers. Oligonucleotides 

5 therefore have many practical uses ihat are well known in the art For exarnple, a, labeled 
oligoriucleoiide can be used as a piobe lo detect tk piesence of a nucleic acid Clenerally, 
oligonucleottdes are prepared synthetically, preferably on a nucleic acid synthesizer. 
AccoiJmjJ) o 11^^ ^nucleotides can be prepared with nnn-naturally occurring phosphoester 
analog bones, such as ihioesier nonds. etc. 

10 A "poi>p.='pride" is a Jkiin ot chemical building blocks called amino acids that Ltre 

linked togcth^i b> ^.^eniical bonds called "peptide bonds". The term "protein" refer.-? to 
pol.N'peptides tliat contain the amino acid residues encoded by a gene or by a nucleic acid 
molecule (e.g., an inRNA or a cDNA) traitscribed from that gene either directly or indirectly. 
Optionally, a protein may lack certain amino acid residues that are encoded by a gene or by 

1 5 an mRHA. For example, a gene or mRNA molecule may encode a sequence of amino acid 
residues on the N-tenninu.'s of a protein (i.e., a signal sequence) that is cleaved from, and 
tlterefore may not be part of, the final protein. A protein or polypeptide, including an 
enzjTOC, may be a "native" or "wild-typs'', meaning that it occurs in nature; or it may be a 
"mutant"^ "vaiiaiai" or "modified", meaning that it ha.s been made, altered, derived, or is in 

20 some way ditBrent or changed from a native protein or from tuiother mutant. 

"Amplification" of a polynucleotide denotes the use of polymerase chain reaction 
(PCR) to increase the concentration of a particular DNA sequence within a mixture of DNA 
sequencos. For a description of PCR see Saiki et at. Science 1988, 239:487. 

A "gene" is a sequence of nucleotides which code for a fonctional "gene product" . 

25 Generally, a gene product is a functional protein. However, a gene product can also be 
another type of molecule in a cell, such as an RNA (e.g., a tRNA or a rRNA), For th« 
pi^^l I'-'-.oHii •■'\c I V f o < Cx^!> f») an mR^SA -cqucfivL \shivh nun h 

kund lu ^ ctU Iror c\ aupic, measurujg gene expiesston ieveis a^coiump t.i ti e nveutios 



-23- 



ti! n > o!K 4X),bl 10 lacasurng mPJMA L\Hs A. gem m u <! n ooiu^n^e .c^ \A itors a c 
n »n V.0 hug^l sequences as wcU as coding sequences i eiip'in itguiatorj, .i^qii.tu*. 
mclude promoter sequences, which determine, for sxarnpie, ihe conditions uadei whicn tne 
gene is expressed. The transcribed region of the gene may also i nclude untranslated regions 

5 inciuding introns, a S'-untranslated region (5'-UTR) and a 3'-untranslaied region (3"-UTR), 
A "coduig sequence" or a sequence "encoding" aa express^ioa product, sweh as a 
RNA, polypeptide, protein or enzj'me, is a nucleotide sequence that, vAen expressed, results 
in the production of that RHA, polypeptide, protein or enzyme; i.e., the nucleotide .sequence 
'encodes" that RNA or it encodes the amino acid sequence for that polypeptide, protein or 

10 enz>Tae. 

A "promoter sequence" is a DNA regulatoi3' region capable of binding RNA 
polymerase in a cell and initiating transcription of a downstEeam (3' direction) coding 
sequence, A promoter sequence is typically bounded at its 3' terminus by the transcription 
initiation site and extends upstream (5' direction) to include the minimum number of bases 

15 or elements necessary to initiate transcriptional ievels detectable above background, WitMn 
the promoter sequence will be found a transcription jiiitiatior! site (conveniently found, for 
example, by mapping with nuclease SI), as well as protein binding domains (consensns 
sequences) responsible tor the binding of RNA polvmerase. 

A coding sequence is "under tlie control of or is "operadvely associated with" 

20 transcriptional and translational conti-ol sequences m a cell when P.NA polymerase 
transcribes the coding sequence into RNA, which is then trans-RNA spliced (if it contains 
introns) and, if the sequence encodes a protein,, is tratislated into that protein. 

The term "express" and "expression" means allowing or causing the information ir 
a gene or DNA sequence to become manifest, for example producing RNA (such as t^A 

25 or mRNA) or a protein by activating the cellular functions involved in transcription anc 
translation of a corresponding gene or DNA sequence. A DNA sequence is expressed by f 
cell to form an "expression product" such as an RNA (e.g., a mRNA or a rRNA) or a protein 



-24- 



The expression product, itself, e.g., the resuking R.NA or protein, may also said to be. 
"expressed" by (he cell. 

The term "transfection" means the introductton of a foreign nucleic add into a cell 
Tlie term "traosfonnation" means the introduction of a "foreigri" (i.e., extrinsic or 

5 extraceHular) gene, DNA or RNA sequence into a host ceU so that the host cell wtU express 
the introduced gene or sequence to produce a deshed substance, in this invention typically 
ati RNA coded by the introduced geue or secpicnce, but also a protein or ;in eazyiue coded 
by the introduced gene or sequence. The introduced gene or .sequence may ai.so be called a 
"cloned" or "foreign" gens or sequence, may inci-ic-c r(-[-i;: y •;•)■ ci>ruro; ^eqxtenccs (e.g., 

10 start, stop, promoter, signal, secretion or other sequences; used by a ceU's geaetic maohinerj'). 
The gene or sequence may include nonfiinctional sequence's or sequerices \vith no known 
function, A host ceil that receives and expresses introduced DNA or RNA has been 
"transformed" and is a "transfonnant" or a "clone". The DNA or RNA introduced to a host 
cell ean come from my source, ifteludiisg cells of the same genus or species as the host cell 

15 or cells of a different genus or species. 

The terms "vector", "cloning vector" and "expmssion vector" mean the vehicle by 
which a DNA or RNA sequence (e.g., a foreign gene) can be introduced into a host cell so 
as to trarisfonn the host and promote expression (e.g., transcription and u-an.slatiou) of the 
introduced sequence. Vectors may include piasmids, phages, viruses, etc. and are discussed 

20 in greater detail below, 

A "cassette" refers to a DNA coding sequence or segment of DNA that codes for an 
expression product thai can be inserted into a vector at defined restriction sites. The cassette 
restriction sites are desigtied to ensure insertion of tive cassette in the proper i-eading franie. 
Generally, foreign DNA is inserted at one or more restriction sites of the vector DNA, and 

IS then is carried by the vector into a host cell along with the transmissible vector DNA. A 
segment or sequence of DNA having insalt;d ot idiled DNA, suet) a'^ ^n expi t.bsi<.)t-, ^ ectu' 
can L -bO bv.' c^ c " J\ \ v t '^w v ' ( - \ ' is ^ P'as-ju I '. \shfc'i 

geneially is a sel1-e>intataed molecule of double tittanded DI^jA, Ubaaily ol Kxtcuai onjm 



-25- 



that can r^mtiuy accejjt addiUunal (Cureiitvi) DNA md svhicii can readily iiitroduced irtto a 
suitable hoi,i cell. A, large mimber of vectors. incJudinfi plasmid and funga! seMois, have 
been desciibed fof replication and/or expression in a variety of eukaryotic and proi-taryotic 
hosts. 

5 The term "host ceil" means any cell of any organism that is selected, modified, 

transformed, giown or used or manipulated in any way for the production of a substaiKe by 
the cell. For example, a host cell may be one that is marJpulated to express a paiticular 
gene, a DNA or RN A sequence, a protein or an enzyme. Host cells can further be used for 

screening or oiher assays that are described infra. Host cells may be cultured in vitro or one 
10 or more cells in a non-iiurnaii animal (e.g., a iransgenic aiiimal or a transienUy transfocted 
animal). 

Tlie term "expression system" means a host ceil and compatible vector under suitabie 
conditions, e.g. for the expression of a protein coded for by foreign DNA carried by tlie 
vector and introduced to the host cell. Common expression systems included, eoli ho,^t ceils 

15 and plasmid vectors, insect host cells such as Sf9, Hi5 or S2 cells and Bactdovirus vectors, 
Drosophila cells (Schneider cells) and expression systems, fish cells and expression systems 
(including, for example, RTH-149 ceils &om rainbow trout, which are available ftom the 
American Type Cul ture Collection and have been assigned the accession no . CRL- 1710) aiid 
mammalian host cells and vectors. 

20 The terms "tnutant" and "mutation" n.eaxi sjxy cKatige in a particular poiymei 

sequence (also sometimes referred to herein as a "parent sequence"). Mutations may include, 
but are not limited to, changes in the nucleotide sequence of a nucleic acid (including 
changes m the sequence of a gene), and also changes in the amino acid sequence of a protsir 
or polypeptide. Thus, in the invention these terms may refer to a difference of even ont 

25 residue (e.g, one nucleic or ammo add), but more typically refer to recombined sequences 
that are substantially diiferent from their parents. That is, a "mutant" includes the oSsprint 
of recombined parent sequences, as by combining (for example) genetic material firom tw( 
parent genes. A mutant may also be refeixed to as a "hybrid" or a "wlant," 



-26- 



The lerni "chimera" rs 53!u>n>r.v.-ic. with "revornbtnmii. muta;it" and refcvs to an 
offspring gene which contains genetic matertat from one or !-nor<.' p-drcnis. 

The methods of the invention may include steps of cooipai iag paxeat sequences to 
each other or a parent sequence to one or more mutants. Such compsrisons typically 
5 comprise aUgiiments of polymer sequences, e.g., using sequence aiigtiment programs and''of 
algorithnis that are we{i known in the art {for exan^pk, BLAST, FASTA and MEGALIGN, 
to name a few). The skiUed artisan can readily appreciate thai, in such alignments, where a 
mutation contains a residue insertion or deletion, the sequence aiigrmient will introduce a 
"gap" (typically represeuied by a dash, or "A") in the polymer sequence not containing 
10 tlie inserted or deleted residue. Tlius, tor exajniple, in an embodiment where a mutation 
introduces a singie ainino acid deletion in a parent sequence at amino acid residue i, an 
alignment of the parent and mutant polypeptide sequences will introduce a gap m the mutant 
sequence that aligns with amino acid residue / of tlxe parent. In such embodiments, therefore, 
amino acid residue i in the mutant sequence is preferably said to be a "gap" or "deletion", 
i 5 Tlie term "iieterologous" refers to a combination of elements not naturally occurring, 

For example, chimeric RNA molecules niay comprise an rRN A sequence and a heterologous 
RNA sequence vMch Is not part of the rRNA sequence. In tWs context, the heterologous 
RNA sequence refers to an RNA sequer.cethat ia not naturaiiy located witliin the ribosomal 
RJNA sequence. Alternatively, the heterologous KKA sequence may be naturally located 
20 within the ribosomal RNA sequence, but Is found at a location in the rRNA sequence where 
it does not naturaiiy occur. As anotiier example, heterologous DNA refers to DNA that is 
not naturaiiy located in the cell, or in a chromosomal site of the cell Preferably, 
heterologous DNA includes a gene foreign to the cell, A heterologous expression regulatory 
element is a regulatory element operatively associated vwth a diftsrent gene than the one il 
25 is operatively associated with in nature. 

The term "homologous" refers to the relationship between two biopoiymers (e.g 
poi^T^^pfide-iOt oiiL'onuclec!!,-.-N! I'-.r. ; ...-^Licomnsonevolutionarj' origin. This includes 
\\ ithoul limitation, proteins from superianuiies (e.g., the imraunoglobulinsuperf^nily) hj till 

-27- 



( to- cxdntr:<\ ni\o^ir Kgiit U>a u poivpcpt.de, ^{l . , < 'ell 1987, SO 667) 

fyxiLi pioicmi (aiid tneii cncooang nucicic acid:,) hdve sequence hon.ology, as reflected by 
their sequence simiiarity, or regions of sequence similarity, however expressed- For 
5 exaiTiple, "homology" can be expressed as sequence similarity in terms of percent sequence 
identity or by the preisence of specific residues or motife and conserved positions. 

The terms "sequence similadty" and "sequence identity", in ail their grammatica} 
for-ms, refers to the degree of identity or correspondence between nucleic acid or araino acid 
sequences that may or may not share a common evolutionary origin (see, Reeck et «/., 
iO supra). However, in common usage and in the instant application, Hie term "horaoiogons", 
particularly when modified with an adverb such as "highly", may refer to sequence similarity 
and may or may not relate to a common evolutionary origin. 

Hie term "recombination" and variant spellings thereof, encompasses both 
"homologous" and "non-homologous" recx>mbmation. In its most basic form, recombination 
15 is the exchange of biopoiymer fragments between two biopolymer sequences. As defined 
in this invention, sequences may be recombined at the amino acid or nucleic acid level 

The teiTO "homologous recombination" refers to the exchange of biopolymer 
iTagnients between two or more biopolymer sequences at locations where tlie sequences 
exhibit regions of sequence homology. In more general biological temis, recombinatiois 
20 refers to the insertion ofamodified or foreign DNA sequence contained by a first vector into 
another DMA sequence contained in sec-ond vector, or a ck-omosome of a cell The firs? 
vector targets a specific chromosomal site for homologous recombination. For homologous 
recombination, the first vector will contain sufiScientlylongregionof homology to sequencei 
of the second vector or ehtromosome tp ail0W complementa^ 
25 DNA from iiie fust vector into the DNA of the second vector, or the chromosome. 

According to the invention, the sequence .similarity of biopolymers being recombinet 
can be high, low, or none, and indeed can range from less than 50% (e.g., 0% to as high a 
100%, Where parent sequences are homologous, i.e. have some threshold of seqnenc 



-28- 



kientiiy, alignments may be used to aid iti the sciection of cut points and fragments for 
rscombinaiion. AKgnnieGts are also used for certain recombination protocols, such as DKA 
shuffling, which can be modeled according to the inventioa. However, other recombinations 
do not require alignments, such as the ITCHY protocol, and these aiso c^n be modeled to 
5 calculate a schetna disruption profile. A model of non-homologous (non-seqiieace identity) 
recombination is illustrated by FIG. lA and FIG, 5, discussed infra. Crossovers can be 
caiculated for 0% sequence tdeniity, as long as tiie parents fold iato tlie saiiie (or similar) 
structures. Cut points are detenntned as iii FIG. 2, winch does not require or imply sequence 
identity, 

1 0 The term "non-homologous recombitiation" refers to the exchange of biopolymer 

fragments between two biopolymer sequences that are not homologous, or that do not share 
sequence identity, for example according to a given threshold. As used herein, non- 
honioiogous btopolymers, like homologous biopolymers, may or may not have a common 
eyolutjonary origin, and in preferred embodiments they do haye a common eyolutionary 

1 5 origin. However, non-homologous biopolymers, imlike homologous biopotymerSs have no 
sequence identity, or the sequence identity (If any) is less than a given mioimiini. 

In certain embodiments of the invention, biopolymers or fragments thereof maybe 
selected for recombination based on any suitable energ>- or structural data, not necessarily 
homology or sequence identity. For example, cut points or schema may be selected based 

20 on steuctural input such as interatomic distances, without regard for sequence identity. That 
is, the biopolymers may or may not have any, or a given degree, of sequence identity, 
Opdmal schema (and fragments) can be determmed from this data wi thout regard for the 
recombination or shufBing protocol. In addition, alignment data from homologous 
.sequences or regions, if any, caii be used as additional stmctnral mput to futther refme the 

25 selected schema and optimal fragments for lecombination. 

A nucleic acid molecule is "hybridizabie" t.o another nucleic acid molecule, such a; 
a cDN \ . Renomic I INA, or RNA, when a .single stranded form of the nucleic acid mole-cuU 
can anneai to the other nucleic acid molecule under tlie appropriate conditions of lemperaturf 



-29- 



a-id solution ionic strength (see SaiTibrdok «/ , supra). The conditions of feraperature md 
ionic strength determine the "stringency" of tlie hybridization. For prelimimry screening for 
homologous nucleic acids, low stringency hybridisation conditions, corresponding to a 
(melting temperatnre) ofSS^C, can be used, e.g., 5x SSC, 0.1% SDS, 0.25% rniik, and no 
5 fonnaraide; or 30% fomiamide, 5x SSC, 0.5% SDS). Moderate stringency hybridization 
contritions coiicipond to <i htc'het T^,&2 ■^^^'^o tot mam de, v.ith ^^x o 6\ ^CC F'lgh 
t*- nwf Ds-'i ji'jSi di/ittoncond tto i^cln^'' pcna t.o ■'he hiOi si 1 orn it in'f ^ 

^r6vS<i ( lwtM^^i^iC' ' ^^^^'\. - \\br > \ jiic ihiuhetso 
ruult;ic<.ciJ uMlmuoi pi t ^hdvpMdi n<-Mri^ m ^tUK 

10 hybridization, mismatches between bases are possible, llie appropriate stiingency for 
hybridizing nucleic acids depends oh the length of the nucleic acids atid the degree of 
complementation, variables well known in the art. The greater the degree of similarity or 
homology between two nucleotide sequences, the greater the value of 1^ for hybrids of 
nncteic acids having those sequenpes. The relative iitabilitj (eorre.sponding to higher Tj^j) 
15 of nucleic acid hybridizations decreases in the following order: ENA:KKA, DNA:RNA, 
ON A:DNA . For hybrids of greater than i 00 nucleotides in length, equations for calculating 
Tjj, have been derived {see Sambrook. et at, supra, 9.50-9.51). For hybridization with 
shorter nucleic acids, i.e., oligonucleotides, the petition of mismatches becomes more 
important, and the length of ihe oligonucleotide deterri^iines its specificitj' {see Sambrook et 
20 al,,supra,\\n-\\.%). A minimum length for a hybiidizable nucleic acid is at least about 10 
nucleotides; preferably at least about 15 nucleotides; and more preferably the length is at 
least about 20 nucleotides. 

Unless specified, the term "standard hybridization conditions" refers to a T^-^^ of about 
55**C, and utilizes conditions as set forth above. In a preferred embodiment, the T^j^ is €0°C: 
25 iQ d more preferred embodiment the T^^ is 65*0, In a specific embodiment, "Mgi: 
stringency" refers to hybridization and/or washing conditions atSS'C in0.2XSSC, at42*'C 
in 50% formam-.de, 4XySC, or under conditions that afford levels of hybridization equi^'alen 
to those observed uiider eiUier of these two conditions. 



-30- 



Suitable iiybridkaJk>ii cojnditions for oligonucieoltdes (e g,, for oligoinicleotide 
probes or primers) are typicaUy somewhat different than for Edl-leiigtli nucleic acids (e.g., 
MMength cDNA), because of the oligonucleotides' lower melting temperature. Because the 
melting temperature of oligonucleotides will depend on the length of the oligonucleotide 
5 sequences involved, suitable hybridization temperatures wjil vary depending upon the 
oligonucleotide molecules used. Exemplaiy temperatures may be 37 "C (for 14-base 
oiigoaucleotideiO, 4S 'X!- {for IV-bapeoiigoimcleotidesK 55 "C {for 20-baseoligomicleQii.des) 
and 60 "C (for 23-ba.se oligonucleotide:,}, li.xernplary suitable hybridization conditions for 
oiigonnck'otidcB iuclutk \vus;ii;u iu (:>: SSC O.Ofj'i-o sodium pyrophosphate, or other 
1 0 conditions that afford equivalent levels of hybridization. 

The term "isolated" means tlnat the referenced materia! is removed ftomthe 
environment in which it is normally found. Thus, an isolated biological materia! can be free 
of cellulai- consponents, i.e., components of the cells in which the material is found or 
produced. In tls^e case of nucleic add inolecules, an isolated nucleic acid includes a PGR 
1 5 product, an isolated mR>i A, a cDNA, or a restriction fragment. In. another embodimeiit, an 
isolated nucleic acid is preferably excised from the chromosome in which it may be found, 
and more preferably is no longer joined to non-regulatory, non-coding region.s. or to othej 
genes, located upstt eam or downstream of the gene contained by the isolated nucleic acic 
molecule when found in the chromosome. In yet another embodiitiet\t, the isolated nucleic 
20 acid lacks one or more introns. Isolated nucleic acid molecules include sequences insertec 
into plasmids, cosmids, artificial chromosomes, and the like. Thus, in a specific 
embodiment, a recombinant nucleic acid is an isolated nucleic acid. An isolated protein maj 
be associated with o&er proteins or nucleic acids, or both, with which it associates in th« 
cell, or with ceUular membranes if it is a membraiie-associated protein. An iso}ate< 
25 orgaueUe, cell, or tissue is removed from the anatomical site m which it is found m ai 
tngamsm A^\ icolated mi^eru' srpv ha but reed nut be purified 

leduce or enmmaie the oicseu».c of uaitu.led matet ah, i c , cunt^m.nants moiadiiig natt\ 



-31- 



materials from which the material is obtained. For exaraple, a purified protein is preferably 
substantially free of other proteins or nucleic acids with which it is associated in a cell; a 
purified nucleic acid molecule is pretexablY substanuaily free of proteins or other urffelated 
niicleic acid molecules with which it can be i:ound within a cell. The term ".substantial ly 
5 free" is used operationally, in the context of analytical testing of the material. Preferably, 
purified material substantially free of contaminants is at least 50% pure; more preferably, at 
least 90% pure, and more preferably still at least 99% pure. Purity cm be evaduated by 
chromatography, gel electrophoresis, immunoassay, composition analysis, biological assay, 
and other metliods known In the ait. 
10 Methods for purification are well-known in the art. For exainple, nucleic acids cari 

be purified by precipitation, chromatography (inchiding preparative solid phase 
cluomatography, oligormckotide hybridization, iiiid triple helix chromatography), 
uUracentrifiigation, and other means. Polypeptides and proteitis can be purified by various 
nietlwds including, witliout limitatian, preparative disc-gei eiectrophoresis, isoelectric 
15 focusing, HPLC, reverssd-phase HPLC, gei filtration, ion exchange and partition 
chromatography, precipitatioii and salting-out Ohfomatography, extraction, and 
countercunient distribution. For some purposes, it is preferable to produce the polypeptide 
in a recombinant system in which the protein contains an additional sequence tag that 
faciiitates purification, such as, but not limited to, a polyhistidine sequence, or a sequence 
20 thai specifically biuds lo au smtibody, such as FLAG and GST. The polj-peplide can then be 
pmified from a crude lysate of tlie host cell by cln-omatography on an appropriate solid-phase 
matrix. Altemati\'ely, antibodies produced against the protein or against peptides derived 
therefrom can be used as purification reagents. Cells am be purified by various techniques, 
including centriftigation, matrix separation (e.g., nylon wool separation), panning and othei 
25 immunoselection teciintqties, depletion (e.g., oompiement depletion of contaminating cells); 
and cell sorting (e.g., fluorescence activated cell sortitig or FACS). Other purificatior 
metliods are possible. A purified material may contain less than about 50%, preferably lest 
than about 75%, and most preferably less than about 90%, of the cellular components witl 



-32- 



which it was originally asrsociated. The "subs^antiftUy piire" indicates the highest degree of 
purity Vv'hic.h can be iuhie\ ed using conventional purification techniques faiowEi in the art. 

In preferred eraboditnents, the terras "about" and "approximately'^ shall generally 
mean an acceptable degree of error for the quantity measured givea the nature or precision 
5 of the measurements. Typical, exemplai-y degrees of error are within 20 percent (%), 
preferably within 10%, and ruore preferably within 5% of a given value or range of values. 
Alternatively, and particularly in biological systems, the terms "about" and "approximately" 
may mean values that are within an order of magnitude, preferably within 5-fold and more 
preferably within 2-fold of a given value. Nun'.erical quantities given herein are approxiraate 
.1 0 unless stated otherwise, meaning that tl\e term "about" or "approximately" can be inferred 
when not expressly stated. 
Molecular Physics 

The term "sequence space" refers to the set of ail possible sequences of residues for 
a pob'nier having a specified lej;igtli. Thtis, for example, the sequence space for a protein or 

1 5 polypeptide 300 amino acid residues in length is the group consisting of ail sequences of 300 
amiiio acid residues, e.g. 2Q^°° = 10"*" sequences of 300 amino acids. Similarly, the 
sequences space of a nucleic acid 300 nucleotides in length is tiie group consisting of all 
sequences of 300 nucleotides, elc. 

"Conformational energy" refers generally to the energy as.sociated witii a particulat 

20 "conformation", or tkee-dimensional structure, of a poiyiner, such a.s the energy associated 
with the conformation of a particular protein or nucleic acid. • Interaciions that tend tc 
stabilize a macromolecul© such as a polymer-.(e.g., a protein or nucleic acid) have energies 
that are quantitatively rspiesented in this specification as negative energy values, whereaj 
interactions that destabilize a polymer have positive energj.- values. Thus, the conformationa 

25 energy for any stable polymer Is quantitatively represented by a negative contbrmationa 
energy value. Generally, the conformational energj' for a particular polymer will be rektec 
to that polyro.ei'.s stability. In particular, polymers and other macromolecules that have : 
lower (i.e., more negative) confotrnational energy are typically more stable, e.g., at bighe 



-33- 



temperaiurcK ^.c, they have grcattcr "themial stability"). Accordingly, the conforniiUional 
eii&igy oi";i polyinei ir:ay aiso be referrsd to as the poiymej's "stabibzation energy". 

Typicaliy, the conformational energy' is calculated using an energ>' "force-fieid" that 
calculates or estimates the energy contribution from various interactions which depend upon 

5 the conformation of a polymer. The force-field is comprised of terras that include the 
conformatioaal energy of the aipha-«arl50ft backfeone, side chain ^backbone imtefaGtions, ajtd. 
side chain - side chain interactions. Typically, interactions with the backbone or side chain 
include term;; for bond rotation, bond torsion, and bond length. The backbone-side chain and 
side chain-side chain interaetions include van der Waals interactions, hydrogen-bonding, 

10 eJectrostatics and soivation terms. Electrcstatic interactions ma.y include coidombic 
interactions, dipole interactions and quadrapoie interactions). Other similar terras may also 
be included. Force-tields that may be used to determine the conforsnational energy for a 
polymer are weU known m the art and include the CIIARMM {see, Brooks etd.,J. Comp. 
Chm. 1983,4; 187-217; MaciCsrelle^tz/.i iii TheEnqyelopediaofCompimtiomlCketnist^^^^ 

15 Vol. 1:271-277, John Wiley & Sons, Chichester, 1998 ), AMBER (see, eornell &t al, 1 
Atmr. Chem. Soc. 1995, 117:5179; Vvoods et ai, J, Phys. Chem. i995, 99:3832-3846; 
Weiner e( ai , J. Cornp. Chem. 1986, 7:230; and Weiner ei al, J. Amer. Chem, Soc. 1984, 
106:765) andDKEIDING (Mayo ef Phys. Chevt 1990, 94:8897) force-fields, to name 
a few. 

20 In a preferred implemenMion, the hydrogen bonding arid electrostatics terms are as 

described in Dahiyat & Mayo, Science 1997 278:82), The force field can also be described 
to include atomic conformational terms (bond angles; bond lengths, torsions), as in other 
I'eferences. See e.g. , Nielsen JE, Andersen KV, Honig B, Hooft RWW, Klebe O, Vriend a 
& Wade RC, "Improving macromolecular electrostatics calculations," Protein Engineering, 

25 i: 057oo2a<^^9) ii D, LocUiart DT, Rha-p KA <t Home ^, 'Calculation oi 
etcctiostatic cttect^ ^ ^ ammo teimmus of an £'lpha-helj>. " B)«)ph\<- , <i7 2251-226f 
(I'^J-t*, He li v-b 'i-. 'RorB l)-^ ' '.p.otfui o ru ui. nit'eMrosiat. 

aiidlybLs" Ptoiem S.icrive, 3 2il-22o a'^J4), 8clmeidci JP, Lear ID Dcuisdo ' 



-34- 



designed buried sa!i bridge in a heterodimeric coii," J. Am. Chem. Soc, 119: 5742-5743 

(1997) ; Sideiar CV, iiendsch ZS, Tidor B, "Effects of salt bridges on protein struaure and 
design,"' Protein Science, ?: 1 898-1914 (1998). Solvation terms couid also be included, See 
e.g., Jackson SE, Moracci M, eiMastr>' N, Johnson CM, Fersht AR, "Effect of Cavity- 

5 Creating Mutaiioiis in the Hydmphobic Core of Chymotrypsin Inhibitor 2," Biochemistr>', 
32: 11259-11269 (1993); Bisenberg, D & ycLacbian AD, "Solvation Energy in Protein 
Folding and Binding," Nature, 319: 199-203 {1986); Street AG & Mayo SL, "Fatwise 
Calculation of Pmtein Solvent-Accessible Surface Areas," Folding & Design, 3; 253-258 

(1998) ; Eisenberg D & Wesson L, "Atomic solvation parameters applied to molecular 
10 dynamics of proteins in solution," Protein Science, 1 ; 227-235 (1992); Gordon & Mayo, 

"Coupled residues" are residues in a polymer that interact, through auy mechanism. 
The interaction between tlie two residues is therefore referred to as a "coupling mteractlon" . 
Coupled residues generally contribute to poiymer fitness through the coupling interaction. 
15 Typically, the coupling interaction is a physical or chemical interaction, such as an 
electrostatic interaction, a van der Waals interaction, a hydrogen bonding interaction, or a 
combination thereof. As a result of the coupUng interaction, changing the identity of either 
residue, will affect the fitness of tl\e polymer, particularly if the chaiige disaipt^ the coupling 
interaction between the two residues. Coupling interactions may also preferably foe described 
20 by a distance parameter betsveen residues in a polymer. If tiie residues are wiMxi a ceilain 
cutoffdistance, they are considered interacting. This approachprovidesgoodresuits and can 
be computed relatively quickly. 

if a coupling interaction is considered disrupted by crossover recombination, a 
"crossover disruption" (Eo) parameter for each mutaiit can be determined. The "crossovei 
25 disruption" (Ee) of a mutant is determined by the number of disrupted coupled interactions 
caused by the cro.?sover from one sequence to another. Coupled pairwise interactions 
between amino acids from different parent stx'iucnccs are summed, while the interactionj 
within fragments mid shared between fragments from the same parent are not counted 



-35- 



Canoi'L'i' or oduem tl (.^siov t lo. vo^s <.i , n ^sresponJ to opo it <i pe ml 
rccombji t i>u Mth ninnnui d ^upl <>\ . ^ _ uiilttun lii ru -^ut g 

pareuUl clusters of tavorab}> inleractir.g UNA residues {buUat "ig bij>. <n sch*- n^', m tlv 
parental genes. 

5 A "crossover disruption profile" is the crossover disruption timi would result if a 

crossover occuixed at a given residue (or each residue) of a biopoiyrner sequence, lbs 
"crossover" refers to a recombination process in which an exch^ol^ge of polymer sequences 
occurs befVk'eesi two linear polymer sequences, e.g. any point at which the genetic material 
from two parents is swiiciied in an offspring. 

JO A "schema disruption" is tke disrupiiou of a set of residues iliat interact in a 

coOeetiveiy beneficial way. For example, it may be harmful to the recombinant mutant 
sequences if the residues participating in a schema come from different parents. Schema 
disruption is a combination of the disruption of independent structural elements (domains) 
or sti-uctural elements tliat cause a breaking of coupling interactions. See e.g., Holland, 

1 5 Adaptation in Natural and Artificial Systems. University of Midngan Press, Ann Arbor, MI 
(1975). 

Thus, schema are clusters of amino acids in the stmcture that interact in some 
positive way. For example, they may interact tlirough hydrogen-bonds to stabilize the 
structure or they may interact to perform the catalytic function of a protein (enzyme). When 

20 . these clusters of interacting residues are separated by recombination (because some come 
from one parent and othere come from adifferent parent), this has a delrimentai effect on the 
protein- e.g. by destabilizing it, or making it non-tbnctional. ."Vn olyective of the inventioc 
is to minimize and prevent schema disruption, e.g. by modeling tiie recombination of paren- 
fragments to preserve schema in the resulting mutants. 

25 A "domain disruption" Is the disruption of a compact structural domain or folding 

unit of a biopoljiner, e.g. a protein. 

Schema disruption and domain disruption may also be profiled, in a manner akin t( 
crossover disrnptiou profiles. 



-36- 



T^he ''crossover probability '\ which is also dencned hert' by the sytnboi P^, Is the 
probability thai, a ctoiisovei wili occur between two given nucleic or amino acid sequenxes 
(foe example, between two homologous genes). Crossover probabiiity is reialed to the 
experimental average fragment size in recoinbin.ation experiments, and is a parameter timt 

5 can be influenced or controlled in certain recombination protocols. For example, crossover 
probability can be controlled in DHA shiifHing according to the time that paremal tempilates 
are exposed to the DNA-cleaving DNAse. In StEP recombiiiation, tins is controUed by 
timing the atmeaUng,%-xtension cycles. The relationship between fragment size f and 
crossover probability can be expressed as: P, (f- l)/'N, where N is either the number of 

10 amino acid residues (when caicuiating recombinant mutants based on a protein sequence), 
or the ntimber of nucleotides (when caicuiating tlie recombinants based on the DNA 
sequence). 

The terms "crossover location" and "cut-point" are synonymous. The term refers to 
the location on a biopolymer sequence where recombination occms. A cut point is a specific 
1 5 position at which a polymer sequence is broken in recombination. 

Tlie tenn "crossover region^'refers to the area suiTonndingthe crossover iocation, for 
example within a range of residues on either side of a cut point, la certain experiments and 
recombination methods the precise iocation of a cut point is uncertain or caimot be 
determined or experimentally resolved. For example, when two parents share sequenct 
20 identit)', it may not be possible to determine from the sequence of the recbnibinant offspring 
precisely where witiiin an aligned or surrounding region tlie cut point (crossover) occurred 
The range of possible cut points, each of which could have produced the observec 
recombination results, can be called the crossover region. Once a region, of sequence identity 
(a crossover region) iias been identified, the specific placement of the cut point is not critical 
25 The term "fitness" is used to denote the level or degree to which a particular propettr 

or combination of properties for a polymer (e.g. a biopolymer such as a protein or a nuclei 
acid) is optimized, iri directed evoliition tnetisods of the invention, the fitness of a polyme 
is preferably determined by properties which are identified for improvement. For exainpU 



-37- 



the fiineSN >!f a piots-in may rersr ic p;oici!f s stubi{i{> U"- £■ ;U diffctviii i.e)-iper;.tturi.'a ur 
in dii tcreru soI\ enisV ns biologicai activity oi elfsctency (e.g. caiaiyTic funorion). iis biiiding 
affinity or selcctivtiy (e.g. enantiosdectivity), its soiubiiily (e.g. m aqueous or organic 
solvent), and the like. 

5 Fitness can be determined or evaluated experimentally or theoretically, e.g. 

computationsily. Other examples of fiteess properties mdiide ensnfioseiectivitjj aciivify 
towards non-natural substrates, afidaltemaii ve catalytic niechamsms, eoupling itlterjjctions 
can be modeled as a way of evaluating or predicting fitness. 

Preferably, the tltnesa is quiaititated so thst each polymer (e.g., each amino acid or 

10 nucleotide sequence) wit! have a particular "fitness value". For example, the fitness of a 
protein may be the rate at which the polynver catalyzes a particular chemical reaction, or the 
protein's binding affinity for a iigand. In anotlier embodiment, the fitness of apoiymer refers 
to tlie cottfonuational energy of the polymer and is calculated, e.g., using any method knowrs 
in tlie art. 

1 5 G enerally, the fitness of a polymer is qnantitated so that the fitness value increases 

a,s the property or combiDatioi^ of properties is optimized. For example, where the therma 
stability of a polymer i.s to be optimized (conformational energy Is prefer^ly decreased), thf 
fitness vahte may be the negative coirfbmrational energy; i.e., F~ -E. 

Such techniques ai'e found in the following exemplary references: Brooks B.R. 

20 Bruccoleri RE, Oiafson, BD. States DJ, Swarninathan S & Kaiplus M, "CHARMM: / 
Program for MacroiBolecuiar Energy, Minimization, and Dynamics Calculations," J . Coixip 
Chem.. 4: 187-217 (1983); Mayo SL, OMson BD & Goddard WAG, "DREIDINGi / 
Generic Force Field for Molecular Simulations," J. Phys. Chem., 94: 8897-S909 (1990^ 
Pabo eo &SuchanekEG, "Compiiter-AidedModelrBuMdm^^ 

25 Biochemistry, 25: 5987-5991 (1986); Laxar GA, Desjarlais JR & Handel TM, "De Nov 
Design of the Hydrophobic Core of Ubiquitin," Protein Science, 6: 1 1<S?-1 178 (1997); U 
C & Levitt U, "Accurate Prediction of the Stabilify and Activity Effects of SiteDirecte 
Mutagenesis on a Protein Core," Nature, 352: 44S~451 (1991); Colombo G & Merx KIv 



-38- 



"Stability aiid Activity of Mesdphtlic Subtilisin E and Its TheiiiiopMii<i Bbmotog; tosights 
from Molecular Dynamics Sinm]ations/' J. Am. Chem. Soc, 121 : 6895-6903 (1999): Weiner 
SJ, Koliman PA, Case DA. Singh UC, Ohio C, Aiagoaa G, Profeta S3, Weiner P, "A new 
force field for molecular mechanical simuistion of nucleic acids and proteins," J. Am. Chem. 
5 Soc, 106: 765-784 (1984). 

The term "fitness landscape" is u- c.Co::.ube ihc set oi aii iitness V(,du::s bckmguig 
to ail poiymer .sequences in a sequence.- i ii-.Ls, ibr exatuplts referring again lo the 

seqaence space for proteins 300 amiao acid residu^^s ;n length (i.e., the group consisting o.f 
all sequences of .300 amino acid residues), each polypeptide in (he sequence space will have 
10 a particular fitness value that may (at least in theory) be calculated or measured (e.g,, by 
screcmag each polypeptide to determine its fitness). The set of these fitness values is 
therefore tlie fitness landscape of the sequence space for proteins 300 amino acid residues 
in len^h. In many embodiments fitness values may vary cotisiderably among individual 
sequences in a given sequence space. The fitness value for a given sequence may be higher 
15 or lower than other, similar sequences in tiie sequence space. These fitness values are 
therefore referred to a-? ^'local maxima" (or "local optinia") and ''local minima", respectively. 
Such a fitness landscape is described as "rugged" when it contains many local maxima 
and/or local minitna in the fitness values. Irt tiie all represetttations of the fitness landscape, 
there is a "global optimum," representing the sequence with tiie highest fitness. If the highest 
20 fitness is degenerate (multiple sequence have the same fitness), then more than & single 
sequence can be the global optimum. An objective of directed evolution and compuiationa] 
design ttietiiods is to generate sequences having fitness yaliies greater than the fitnesi 
values) of the starting (e,g. parent) sequence or sequences. In a preferred embodiment o: 
the invention, the directed evolution atid computational design methods generate sequencef 
25 havip.g fitness values as close to the global optimum as is possible. 

The "fitness contribution" of a polymer residue refers to the level or extent JIO U 
which the residue s^, having an identity' a, contribtites to tire total fitness of the poiyinei 
Tims, for example, if changing or mtitating a particular polymer residue will greatly decrease 



-39- 



the polymer's fitness, that residue is said to have a high fitness cQntribytion to the polymer. 
By contrast, typically some residues /„ m a polymer may have a variety of possible identities 
a without affecting the polymer's fitness. Such residues, therefore have a ioxv contribmion 
to the polymer fitness. 

5.2 Gener al Meth ods 

In accordance with the invention, there- n^fiy be en-iployed conviuitjonal inoiecuiar 
biology, niicrobiology and recomb;nani DNA techniques within the skiil of the art. Such 
tech.mqucs are explained ftiity in tlie literature. See, for example, Sanibrook, Fii&ch & 
Maniatis, Molecular Chning: A Laboratory Manual^ Second Edition (1989) Cold Spring 
Harbor Laboratory Press, Cold Spring Harbor, New York (referred to herein as "Sambrook 
et al, 19S9"); DNA Cloning: A PracUcal Approach, Volumes I and 11 (D.N. Giover ed. 
1985); OUgonuckotide Synthesis (MX Gait ed. 1984); Nucleic Acid Hybridization (B.D, 
Haines & SJ. Higgins, eds. 1984): Animal Cell Oiiture (P..L Freshney, ed. 1986); 
Immobilized Cells and Enzymes (mL Press, 1986); B.E. Perbal, A Practical Guide to 
Molecular Cloning (1984); ¥M. Ausubei et al. (eds.), Curretu Protocols in Mokcttlar 
Biolog}', John Wiley & Sons, Inc. (1994). 

The invexition pertains to a computational method for identi tying cut points or 
locations in proteins that will permit cros.sovers in in vUro recombination experiments, wmie 
retaining structma! stability (and consequently, desirable properties) in the offspring hybrid 
proteins. The invention can be applied to protein sequences of any or no sequence simi laf ity. 
Sequence a«.d terUaty structijrai inforiiiatioiiat theprotei for at leastone of tiie starting 
parental sequences is used to identify structural domains or coupled residues and calculate 
their disruption. 

5,3 •Ove rview of Mod eling Teclmjques 

Disniption Profiles. According to the invention, recombination modeling 
calculations are applied to determine the disruption of a biopolymer fragment (e.g. a schemf 



-40- 



disruption fr<>i.!'t.> irtJ/o i aoj'-oxa \i <>r. pu>ti!i) .oaa-< c to til^. en jiuJi^f 
Stmctuie fnotbc! ^'.orc vvKai<;tiuclna thangesproouccd -vtccoivbuisJ lor 'u k >7ipitiD!e 
withaparen.tordfuactiOiM'i'j.iaijnc'' or' 'cfercnte^sinicturt;" la this ^^-ay letvombmattons 
(arid recombinants) that arc predicted to disrupt schema (or coupling interactions) can be 

5 eliminated in favor of a smaller librarv' of recombinants predicted to preserve them. This 
library is EttDte likeiy to contain offspring which retain essential and/or beiieBciai 
propeitie$(s«ch as activity and stability) and can be searched for other or improved properties 
relative to their parents. The techniques for determining disruption profilea include: (a) 
calculation of croasover disruption, e.g. using distance-based or energy based criteria for 

10 coupling: (b) calculation of domains in the protein structure; and (c) calculating the 
disruption (e.g. a disruption profile) based on a crossover disruption, domaiti disruption, or 
both. A schema disruption based on a combination of the domain and crossover disruption 
is preferred. Distance-based criteria for ci-ossover disruption of coupling is also preferred. 
Recombmatlon Modeling. Calculations are made to model possible parental 

15 fragments for recoinbitiation based on: (a) a requirement of sequence Identity between 
parents (for sequence-identity-dependent experimental protocols, such as DNA shuffling); 

(b) a constraint on the number and location of crossovers (for example, the ITCHY protocol 
allows a single cut point, which ver}- substantially reduces the mimber of possible fragments: 

(c) other specified constraints, e.g., exonshuftliag; and/or (d) a protocol without constraintt 
20 (used to determine the optimal crossover). 

S.4 ScH^fflji Pisruption Mode! 

Interactions among residues of a biopolymer cai\ be modeled as schema, which ii 
turn can be ewluated (e.g. in a schema disniption profile) to determine optimum crossove 
25 locations for recombining two or more parent molecules. Schema can be based on couplinj 
interactions between residues, e.g. based on conformational energy and/or interatomi- 
distances. According to the invention, crossover locations that do not disrupt couplin; 
interactions or schema are preferred. 



-41- 



Piiiiciples of crossover disruption of coupling interactions accordingto the tnventioii 
are illustrated in FIG. 2. A "Protein Z" having amino acid residues (shown as circles) at 
positions I through \ 2 is shown in cartoon form. In part A of FIG, 2, Protein Z is shown in 
a folded cartoon at the left, and in & two-dimensional representation of its folded three- 
5 dimensional conformation at the right. These drawings indicate the relative location or 
poM tivn in spuce of each residue with rtspect to the other residues, Tlie black hne represents 
poptidc bond> bc^vetrn ih- leiiduo.s M2 'Iht. grv> dutied ilnes soprt-stiif (,ouphv: 
itt,oKiaR>riibet\sti,ii auiao at td i.ic'e I'uTns Tor example, icsiduc ^ i? ioned\-> lO-id'R-s 
3nd '1 b) peptide bo \ds (solid It ics) Residue 3 is coupk-d to ics due.s ] ' .md \ I b) cou phnv, 
1 0 :ntcracuons (.dotted Unes), which may be associated with anj molecular force?. oihe£ ihan the 
peptide bonds of the protein's primaiy structure. 

The coupling interactions can be mapped to a coupling matrix, as shown for example 
in part B of FIG. 2, In this vitw of the matrix, the primary ainino acid sequence 1-12 is 
shown in linear form, with each superimposed line indicating a coupling interaction. The 
15 number of interactions affecting each residue is conveniently shown. These lines also show 
which residues are coupled to each other. 

According to the invention it is desirable for recombination to minimize disniptioti 
of coupling interactions. This can be achieved, for example, by cutting the sequence foi 
recombination at locations selected so that tlie least number of intsractson.s are separated onto 
20 different fi'agmeiits. Desirable or optimum cut points can be identified wtli the aid of ? 
crossover interaction profile, or of a crossover disruption (E^), as shown graphically in psm 
C of FIG.2. The graph shows the crossover disruption E^, or the number of couoHr.f 
interactions that are broken (y-axis), for each residue of the protein (x-gxis), when a singk 
cut is located before each residue, (A. cut point can be named for the residue it follows o; 
25 proceeds, hi tliis exajiipie, each cut poiiit occur.s before the nmued residue.) For exismple 
if Protein 7. is out at residue 3 in a rccombinaiiori experiment, i.e. the cut is between residue 
2 an.d 3, the resuiuug iVagmeuts for recombination (e.g. from two different parent proteins 
are a fragment 1-2 and a ftagment 3-12. (Note that in the example recombination actnail; 



-42- 



occurs at the genetic bvd, ata eotresporidingcut point in a uucleoiide sequence that encodes 
Proteiii Z.) The graph C of Fig, 2, line the diagram B, show that fof this hypothetical protein 
Z, a cut point at residue 3 will disrupt seven coupling intemctioas. Crossover disruption can 
be calculated by computer, using fcnown programming methods. 
5 For the simple structure of Protein Z, the graph shows that the crossover disruption 

i<! g eatest it a ciit is made n the centei of tho gene e g jt .3 nfc'eotidf htplet o codon 
-ot e pon-^nij lO ou^.' "^t 1 amino id ie:Kkp<- 4 b A i-or 1 iic t . tl^ in\e Ui u <^ n point 
ire i^kciv i to mimmi;^ <. i>- osi. ^ ' o ^i re i bi ^)t^^'c■^anpl lov d 
St. ii i.tini.' t. ut ponils ~l{ }i i. e ^ \ ' iiij^ k J LUt po a' at ii.,suau" 

10 11 (e,g. parent A donates residues 1-10 and parent B donates residues M~12) wiU produce 
mutants having less crossover disruption than a cut point at residue 6 (parent A donates 
residues 1-5 and parent B donates residues 6-12). Mutants with less crossover disruption are 
more likely to be functional and retain desirable properties ftom one or both parents. When 
such parents are used m directed evDlution experiments, the probability of addijig or 

15 improving dedrable properties, wthout loss ofstability, utility or 

in this way the methods of the invention arc not random. Cut points tor recombijiaiion are 
not obtained only as a random consequence of knowi directed evolution methods, such as 
error-prone PCR, faniiiy shuffling or StEP, Rather, more favorable or promising cur points 
are identified and preselected accordiitg to an evaluation of coupling interactions and 

20 crossover disruption, as iliusti^ted in the coupling matrix of FIG, 2. 

The invention is not limited to the tise of a single cut point. More than one cut point 
may be used to provided a plurality of fragments from two or more parents foi 
recombination. For example, two cut points can be selected for hypothetical Protein Z. 
indicated by scissor icons in part B of FIG> 2. When the residues between these cut pointt 

25 come from pjuent A (residues 4-7) mid the temiirial fragmente come from parerit B (residuej 
1-3 and 8-12) the crossover disraption is reduced to zero. According to tiie invention, these 
cut points and the resulting parental fragments would be preferred for reconibinatior 



-43- 



experiments, e.g. where mutants obtained fiow such recon-ibmations are screened for 
desirable properUei;, including new or modified properties, or the ioss or reduction of one or 
more undesirable properties. 

Calculation of Coupling Interactions From A Crystal Structure 

5 As shown by FIG. 1 B, a structure fs!e of a parent polymer is obtained, such as a data 

£{ie representing the three-d'mens.ional stnxtute o( a gene nt a piotein Data*->a^cs uf this 
land u tno^^ lan fh^- art Coupling intera«:;tions Ix txst th-- hinid'^t' bl >' ol i!n polvu vt 
oro th:n idrnuf'ed trom vhc stiuctdial data inr:: r-^tli d , des^nbe boteut fion. ihc 
ul<" nifiod t ^ uphnp" liite! actions stUituird df j = n ii'^fU t unis (^t ^irn-liut \ In'" 

10 lienulledaad represented as a schema for the poUiuei Foi example, wUa tnepoI> mei is 
a protein, and the schema bnildiag blocks are amino acid residues, the set of residues 
contributing to each domain of the three^-dsmensional protein structure can be determined. 
Because a protein is folded, the residues which interact and participate in a domain may atid 
often are not adjacerit to each other in th(& linear or pf iiharj' sequsrice of the protein. This is 

15 shov>'n, for example, in the cartoon for "Protein 27' in FIG. 2A, Avhere amino acid residues 
1 and 8 are close to each other in the three dinieasioual sti-ucture. 

Domair-s, fm example folding dom,aii:is, can be identified by testing for residues 
which interfere with structural stability, and which form groups of residues that are 
considered es.sentiai or important to stability, based on threshold criteria as described herein 

20 {e.g. confomiatiorral energ}' or atomic distance tliresholds), Grotips of residues wluch, if 
altered, wotild significantly impair stxxicturai stability are identified as domains. Crossover 
disruptions can be calculated for the residues, uising the methods described herein, to identify 
domains and generate schema profiie. See e.g., the accompanying Examples, and especially 
Example 6.3, tor domain identification, and schema and crossover disruption based or 

25 distance criteria. 

Once the domain<5 and sets of tnieiaclinsj building blocks are identified and a scherns 
i v-rt ^ V' , .o^.' d ur .V'^-- v.u . <. i.. oos III ir^uu^; for ..i 

doiua.ri, uf ttis.. poi>uKJ le plotted as a schema disruption protlte as Jesorsbou her? in 



-44- 



and in n Eiiunni-:! Kunilar to :i crossover tUsrupUoa piolll?. To deteimirte fbo crossover 
distupli'^ii and aenerare profiis, a tlueshold disruption value is set The conuibuUosi of 
each residue oi each doroai !i to the structural integrity or fitsicss of the polynier is evaluated, 
based on the degree to which it interacts with each other residue of each other domain. This 

5 is compared to the threshoid crossover disruption, which is determined empirically or is 
modeled as a probability as described above for E^. in a DNA shuffling recombiiialion 
context. Domains which exhibit a low crossover disruption compared to the tiii-esiioid are 
"rejected", meaning they can be substituted without disrupting the structure. Domains which 
exhibit a high crossover disruption are "acceptecr', tiieatnng that they sic schema wliich 

10 should be preserved in. the offspring. This follows from the principles described above. 
Domains wiiich are essentia! or important to the stmctural integrtiy or shape of the polymer 
(which have aliigli crossover disruption) should liot be disrupted by recombitiation, in fevor 
of crossovers in domaiits that are less essential or important to the structural integrity or 
shape of tlie polymer (tijey have a low crossover disnjption)- It should be noted however. 

15 that the tenns "accept" and "reject" (FIG. IB) are relative, and could be interchanged 
depending on the desire point of view. Ihus, domains with a low crossover disruption coulc 
be "accepted" as candidates for crossover recombination. Domains with a high crossovc; 
disruption smiM be "rejected" for crossover recombination, so that those doniains can hi 
protected or preserved, 

20 The process of accepting and rejecting domains to generate a schema disniptioi 

profile can be performed iterativeiy, until all residues of all domains are identified attd thei 
relative contribution to the structure of the polymer is determined. When this is "Done 
(FIG, IB), the data is used to mark all domains that are disruptive, so that they will h 
preserved ~ crossover recombinations in these domains will not be modeled or performed 

25 From the remaining domains, optimal crossovers can be identified. These are the sets o 
posjsible crossovers witliin the low disruption domains that are calculated to perturb th 
polymer tlie least, while olfeing the best chances for new or improved properties. 

-45- 



'Hie Imt two steps of t'K; . 1 B ;-:e optional I{ a recombiti-alion protocol is to be used 
for directed evoUslion tv<peii[nf:i{s, t.he ptotocol may havt; rcatvictiouc on the crossover 
locations which are accessible to the method, or the number and maiuiei: in which crossovers 
occur. Using a cut point or fragment file which identifies and represents these restrictions, 
5 the sequence space of optimal crossovers from the previous steps can be further limited or 
rtduccd to Uio5« which also satisfy the restrictions of the experimeiilul piotocoi Foi 
It i\m h f^e ; on horno!ot\ous rt^conib' nation, sequence identity or aiignments- 

i L f n.fc<i ui Via. 1 \ aiix riii. 12 nu> used in combmj^fion witii *he .lon- 
i cr.oioi.'nu. nutl d-sf < - cr^. . i ^ £v- .o,.cc to tTG. IB 
10 Conceptually, a s.n of possib.e pdie us b "ch cf^d bfi-ed on stjuot jt^ 1 stiujlantv h- 

one embodiment, the parents tan bt identified basec on regions of sequence idortU\ Lsing 
the computational methods described herein, a set of all possible cut points for these parents 
can be generated. These computations are independent of any constraints on recombination, 
for example iiraitations wliich may be posed by particular protocols for directed evolution, 
i 5 The set of optimum cut points can then be determined from the set of all possible cut pointa, 
using the methods of the invention. More particularly, cut points are selected to minimize 
the disruption of coupling interactions in the three-dimensional structure of the protein 
Recombination or evolution methods can then be selected and adapted to cut and recombint 
the parents at the selected cut points. 
20 In prefen-ed methods of the invention, once the p;^-ental sequences are aligned am 

candidate cut pokits identified, the structure or conformation of one of the parent sequence; 
is also obtained or otherwise provided {FIG. lA). The preferred me&od of the inveatioi 
requires the structure or conformation of a parental amino acid be obtajned or otherwis< 
provided. In many preferred embodiments, and particularlyin embodiments wheiPe the paren 
25 sequence is the sequence for a knovm protein or nucleic acid, the structure or confonaatio: 
of the narent sequence will he fctiown and can be obtained from any of a variety of resource 
^lorarcMc^^,see^o&Jefc^ .\'! .J'a.^^ 39 -'6-73). Forexample,an 

not by way of hmitalion, the Ptotem L>ata Bani. {VUh } (Bennan ei al . AW. AddsResl 200{ 



-46- 



28:235-242) is a pubik repository of three-dimensioriaJ structures for a krge number of 
macromolecules, including the structures of many ptoieim, nucleic acids and other 
biopoiymers. 

AUeraativeiy, in many embodiments the strucfisre of a poiymer (e.g., pioteiii) 
S sequence that is similar or homologous to the parent sequence will be known. In such 
instances, it is expected that the conformation of the parent sequence will be similar to the 
known structure of the homologous pol>mer. T;k- ki;.>-.v:, siiucluje may. therefore, be u^'ed 
as the structure tor -he. parent sequence or, -nore prcierir.ls, may be used lo predict the 
structure oi" the p-.r^.-ni sr-;.;:.;.:nce (i.e., in "iiomology modelinp:'). A<. ?. pifliriiiar exampie, the 
10 Molecular Modeling Database (K4MDB) (see, Wang e( ai, hud. Acids Rei. 2000, 
28:243-245; MarcWer-Bauer et ai, NucL Acids Res. 199% 27:240-243) pmvides search 
engines that may be lised to identify proteins and/or nucleic acids that are similar or 
homologous to a parent sequence (referred to as "neighboring" sequences in the MMDB), 
induditig ndgliboring seqitenpes wbose- three-dimensional structures are kno\vn, 'Che 
15 database fluther provides links to the known structures along with aligmnent and 
visualization tools whereby the hoinologous and parent sequences may be compiled and a 
structure may be obtained for the parent sequence based on such sequence alignments and 
known stmctures. 

In other embodiments, where the structure for a particular patent sequence may not 
20 be known or available, it is typicaliy possible to determine tiie structure using routine 
experimental techniques (for example, X-ray crystallography and Nuclear MagQetic 
Resonatice (NMR) spectroscopy) and without undue experimentation. See. e.g., NMR of 
Macromol&ctdes: A Practical Approach, G.CK, Roberts, Ed,, Oxford University Press Inc., 
New York(i993), Aitematively, and in le^is preferable embodiments, the three-dimension^ 
25 stiuetuieofd!' i ' jl crlcidaied from the sequence itself and using a^/mi'ii; 

moicouhn nioaer!; > n e i ] .ji .i'r^.;,oS k'lown in the ;ul. Threa-dimensional structures 
obtaifKd Iron) fj'^ o roo'o.',".: ^-e p^^M_ -.^I-ahJc than muctures obtained ut-^inj 
ciiiphicdl ^e.g., KMR .spc^iioscopy oi 5^-tay cry&taUogr,4>b.y) t^r semi-empiricai (e.g. 

-47- 



homology modeling) techniques. However, such structures will general iy be of sufficient 
quaiily, aithough less preferred, for use in the methods of this invention. 

Cafcidation of a Schema Disruption Profile 
5 Once Uie three dimensional amino structure of one of the parental sequences is 

dcicmimed, the melhoc Df the mveniton provides foi the cielernuudtiOft of coupbnj;, 
inkT,.cuoub tvtA en } uuMse .nuno atul vd^. cUaim In a prefcixeu crnboa tiiont the 

1 0 Dinary fashion, h or example, if t tsrdues 3 and 8 oj d h'ruoture arc tnc oni> coupled le^iduee, 
then the (3,8) and (8,3) members or cells of the NxN matrix can be set to 1, and aO other 
cells are set to 0, 

The coupling interactions ctaii be defined by the, determmalion of conformational 
energy between residues, or based on distance parameters such as Interstomic dislajices (the 

1 5 distances between atoms in residues of the polymer). Calculations based on distances are 
preferred. An energj' or distance measure that is outside a ceitainthreshoid between residues 
can be used to determine that the residues are considered to be uncoupled. For e.xample, 
in erabodinieats based on coafoniiationa! ener^o" distance, only those residues that 
exhibited a stabilization or conformational energy below a defined tlireshold, or v4thin £ 

20 tlireshold interaction distance, are considered to be coupled. For example, in a preferrec 
conformational energy embodiment, the tiu'eshold was defined -an 0.25 kcal'mol. 

5.S Modd mg Rfceombinaf ion Based on Fra gmeftt Restrictions 

According to the invention, recombination protocols that limit or restrict th< 
2.5 fragments v^ch can recombine can be modeled, and optimal crossovers irom a set or subse 
of fragments can be determined. 



-48- 



Secrmnm kkntity Eased Recomhimtion 

hi this example, the inyentloti is used to model the recombination of DNA sequences 
using methods that rely or depend on sequence identity. FJ<5 1 A provides a flow diagram 
inustratmg a general, exemplaiy embodiment of the methods used in this invention. A 

5 slaiieci artisan can rejdiiV appreciate that cenair: rr:./;>' oniit-A-d anJ the order ofihe. 
sfe.ps may be ehfiriged ' i'l p:irticiiiar, the flow iAzg:<:^n \\\ FIG. I A as vveil as other examples 
piesented in SectioQ 6. ;i:!y<;. o;-. ,v ri':-^ '.xeierred embodiments where the methods were used 
in directed evolution of a protein or other polypeptide. Those .skilted in the art can readily 
appreciate, however, that the methods iiln-st^fsted hy the=;e examples ai\d throughout itu3 

10 specific-ation may be used to modify any polymer or biopolynisr, including aiiy ainino acid 
or nucleotide sequence, or any DNA or RNA molecule. 
Parmi Sequences. 

The method shown in FIG. lA begins svHh the selection of "parent" polymer 
sequences. For example, the parent sequences may be any amino acid sequence and may or 

1 5 ' may t^ot correspond to a naturally occurring polypeptide. i?ach protein sequence is; preferably 
associated with a nucleic acid sequence (e.g., a gene encoding the protein). A preferred 
embodiaieiit utilizes homologous amino acid sequences. Another preferred embodiment 
utilizes non-homologous amino acM sequences- Preferably, the parent .sequence is also tlie 
sequence for a protein that has some level or degree of acti vity or fimctioa (e.g., oataljlic 

20 activity, binding af&iity, solubility, thermal stability, etc.) to be optimized, llie methods oi 
the invention may then be used, e.g., to optimize the activity or function of the paren 
sequence atid/or to optimiKe the activity in altered conditions. For example, in onf 
embodiment the parent sequence may be a proteui having a particuLv ca.tal>4ic or othe: 
activity, and the methods of the invention may be used to identify sequences having thf 

25 same activit>f but under different (generally more extreme) conditions such as conditions o 
temperature or of solvent (including, for example, solverit polarity, salt conditions, acidity 
alkaiinit}', eic). In another embodiment, the parent sequence may have a particular level o 
amount of activity (e.g., tmtalytic activity, binding affinity, etc.), and the directed evolutlo; 



-49- 



null' d .)t tti^ t!i\< t tioti !) <i> W use4 to idtaitify sequences having improved levels oi" 
amoiiifis - i that at ■ -in ii\ (e g . higher bir.dtng affinily or increased cgtaiytic rate). 
Align Polymer Sequences. 

Once the parental sequeoces are selected, the sequences are aligned (FIG 1 A). The 
5 invention con templates alignment of parental sequences in either nucteic acid or aniiao acid 
fomis. In a prefsn-ed embodiinent, homologous (evoiutionarily related) amino acid pareiital 
sequences are aligned based upon sequence identity, sequence similarity, or a combination 
of both paramcfers. The various parameters associated with alignnient of amino acid 
sequences is well known in the art. In another preferred embodiment, tiie parental sequences 
10 are aligned as nucleic acid sequences. In. a preferred embodiment. .the nucleic acid sequences 
are aligned based upon regions of sequence idei mty. 

Alignment of parental sequences caji be accompUsiied visually or with the use of 
algorithm. The invention encompasses the use of, but is not limited to, the following 
alignment programs: GAP, BLAST, FASTA, DNA Strider, CLUSTAU eaid GCG. The 
i .5 i nvention includes the use of default parameters and standard parameters of the computer 
programs. It preferably includes the use of alignment parameters routinely employed in the 
art. A preferred embodiment of the invention, utilizes BLAST amino add aiignmcatprogram 
to align homologous sequences . Each parent sequence is aligned with the structure sequerice 
using a BLAST algoritiim for comparing two sequences. Tatusova, T. A. & Madden T, L., 
20 FEMS Microbiol Let!. 174:247-250 (1999). The BLOSUM62 matiix is used to score sirailaj 
amino acids and the open, gap and extension. gap penalties are 1 1 and I , respectively. 
Determination of Possible Crossover Locations Bas^d on Hybridization 
The invention encompasses a computational "in silico^' simulatioa of m vitro and ii 
vim recombination. Tlie types of is vitro recombination that are simulated include, but an 
25 not limited to various forms of recombination metliods such as, DNA shuffling, StEP 
i-andom-priming recombination, and DNAse restriction enzj'mes. 

Crossover locations for recombination can be determined based on hybrldizatioi 
between parente. When parental sequences contain areas of sequence identity, aligne. 



-50- 



seque»ces can be examined for areas of ideiitity based upon a predetermined subset or 
number of sequefttial identical amino acids or nucleotides in two aligned parental sequences 
(FIG. 1 A). A preferred number of sequential identical mxim acidfs related to the required 
length of the DNA for hybridization to occur in a particular recombination experiment. A 
5 preferred embodiment is to search for regions of four identical amino acids, or six identical 
nucleotides shared by the parents. After identificatior. of the areas that meet the selected 
parameters of sequence identity a cut point in the identified area of fsequersce identity on the 
parental sequence is selected as a crossover location. The placement of the cut point within 
a crossover regions is not critical. As one example, the cut point may be selected at aiiy 
1 0 location within the identified region of sequence identity. 

In one particular embodiment of tlie invention, a computetioaai algorithm was 
utilized to mimic DHA shuffling recombination. Starting with the aligned parental DNA 
sequences and their respective possible crossover locations (i.e., all possible cut points), a 
tandomly selected pareni^i DNA sequence served &n the initial template and was copied to 
15 mutant offspring. When an identified candidate crossover location was reached in tlie 
cop>dng process tlie parental template was switched to a randomly selected different parental 
template under specified conditions. In a prefen-ed embodiment, the specified condhions 
were set as follows: (1) a randomly chosen aumber between 0 and 1 was less than a tlireshold 
of (e.g. O.IB) and (2) a minimum of eight amino acids between identified crossovei 
20 locations where crossovers actually occurred. The value represents the average mimbei 
of fragments that each parent gene is cut into. For example, in DNA shuffling experiments, 
this parameter is related to the time thatthe.parent template.:DNA is exposed to the DNA- 
cleav sng enzyme DNAse. The expression to obtain from the average mjmber of fragments 
f is =^ (f-l)/N, where N is the size of the gene. The value 0.03 was set to model th« 
25 frago t^-n t sui. t opui ted by *^tcmmer, supra for the beta-iactamase shuffimg expenment 
Detenninmg Crossover Disruption. 

ilic ^.a ii art t 1 c ' t V. 1 V n, i< 1 U-.0 1- uent I '-tyuoi ^ 
whcic ie(,oiuL luation i,ho tld be iiost Micce-, -.iui uut to mmrnal oi^ajption of t(.rtian imin 

-51- 



aciu -n^eiicl.om n a cios'Awcr muart A uossover dhruprion H^^. fo; raJi 'Tiutant !b 
tj!,tt.inji.v'i hi Jtu' cuib»iJiaK'nt the ifU'enlioti, coupltiis.' mtt,: *:i3 vl- ^ coimd:i<:r 
diSiupted if one of the aTniiio acid pairs oi an interacting pairs is lepkiced wiiii an amino acia 
from a different parent sequence in the hybrid mutant protein. The crossover disniption for 

5 a particular mutant is determined by the summation of all coupled interactions that are 
ednsidered disrupted. 

Election of a crossover mutant with minima! crossover disrupdon. 
Once the crossover disruption for a pool of mutant biopolymers is deteraiined, a 
tiueshold is applied to screen tiie mutant biopoi^'niers for those mutants that exhibit minimai 

1 0 ainotints of crossover disruption. Non-i imitiiig cxoinples of selection parameters include the 
foliowing; (1) m application of a threshold, (2) selection of 10% of the mutant pooi tiiat 
exhibited the least amount of crossover disruption, (3) selection of the 10 mutants that 
exhibited the least amount of disruptioHj (4) selection of crossover mutants exhibiting a 
crossover disruption below an average value, (5) selection of crossover mutants exhibiting 

1 5 crossover disruption below a first standard deviation or more. In a preferred embodiment of 
the method, a tlireshold is applied such that 1% of the total mutant pool is allowed by the 
threshold. In another embodiment of the invention, a more stringent tlireshold is utilized, 
whereby only 0.001% of the pool is allows by the threshold. 

A variation of tliis method, as depictetl in die flow chart of FIG. lA is shown 

20 dsagranunaticallyinFiG. 12, 

Non-Sequence Identity Based Recombination 

Recombination that is not dependent on sequence identity can be also be modeled 
according to the invention. This can be called "non-homologous" recombination. Schema 
25 based on structural features of parent polymers are identified, such as three-dimensional 
domains of a protein, and accordingly, it is not necessary to align parent polymers in this 
approaGh. 



•52- 



Other recujnbinalion ii)J.-iKo-.:5 i':;v:u the number of fiagir;cnts and the iocalioiis for 
crossovers bet.ween the p<irei-ilK. For exampie, the ITCHY protocoi limifs iscorabinarion io 
one crossovex poinl. Olhsr known protocols itse restriction enzymes to cut ai voty specific 
locations in the gene, based on a stretch of DNA sequence 3-5 nucleotides ioiig. If restriction 
5 enzymes are used to fragment the parents, then crossovers occur based on the set of 
restriction enzymes chosen by the researcher. For example, if a restriction etiZj'me is chosen 
that only cuts at ATGG, then crossovers can only occur where ATGG appears in the parental 
DNA sequence. 

1 uuhtr nieil ^ ioth i omU luot < t \u r <-^n 

10 bcu-^po)ed (r>on It- I ' >' tlj^ cs^Uitn (^.tlriiitm 

sttpofttdnsusptioi) ii/!C I ll i'ioi our ^ > i 

re^tucteu The restriction-* that rt^uh tiom xi cse i iuIIk di, n bt snc ud<. tn iKl catculatjo i 
and computation described 1 eie, for example bv noting tlic potcnual (,ro:.su\ ci j^ui us . \6 
either reconstracting possible cMmenc mutants, as oescnoca infra, or by noangthe iot;auuii 
1 5 of these crossover points with respect to the disruptioa of schei-na. 

The schema disruption calculation provides a guide for both the restriction-eiwyme- 
based and exon-based recombination methods. From a starting database of exons oi 
restriction eimmss, a subset can be chosen that generate crossover locations that minimizt 
tiie schema disraption . This subset has a higher likelihood of geaeratiag chimeric mutanfc 
20 that are strucitiraUy stable, thus generating libraries where improvement in the desirec 
properties are more likely. 

5.6 Directed Evdlutton Metho ds to Tar get Optimat Crossover Loc»tioa s 

The methods described above are particularly usefTil for Sx&xted evolutio; 
25 experiments, e.g., to obtain proteins, nucleic acids or other polymeirs having one or mor 
desirable prop^ties. For example, the computational models atid protein design algoritlim 

can be used with directed evolution techniques to target mutants or liybrids within a subse 
of the total sequence space, and particularly within a sequence space corresponding to Mghe 



-53- 



fitness probabilities, AccorJmgly, the iiivcntion provides genetic engineering methods, 
including methods oi uirt-cted evolution, for obtaining polymers that have one. or rnosc 
improvedi properties. The improved properties include any property or combtnaiiou of 
properties that can be detected by a user and include, for example, properties of catalytic 
5 activity (for example, increased rates of catalysis), properties of stability (for example, 
increased therrnal stabiUly) or properties of biadiag afSility (for example, increased affinity 
for a partic«{ar iigand or iricreased affmity for a substrate) to name a few. Preferably the 
desirable property is a property that can be detected in a screening assay. 

1 0 Mtdagemsis and Recomlnnaiion 

In general, directed evolution methods comprise selecting at least one polymer 
sequefice. The poljoner sequence is preferably the sequence for a biopolymer (e.g., a nucleic 
acid or a polypeptide) that has a particular property or properties of hiterest. For example, 
the particular ptoperiy of the parent may be a parlicttlar catalytic activity, btiidiag to a 

15 particular substrate or Hgand, fherraal stability or a combmatton theretaf, Preferabiy the 
propert)' is one that can be readily determine or evaluated by a screening assay, e,g. a high 
throughput screen. One or more residues of the parent polymer sequence is then selected ot 
targeted for mutation, lu traditional methods for directed evolutio-U, se section is random. For 
exattiple, all or a large fraction of the residues are available md/ov are selected, e,g., by erioi 

20 prone PCR or L)N .A. shuttling. However, in tlic directed evolution methods of the invention, 
specific residues in the parent sequence are identified as candidate crossover locatioiiS. The 
crossover locations may be identified, for example, according to the analytical methods 
described above. 

One or morej, and preferably a plurality of mutant polymer sequettces may tlien l3£ 
25 generated based on tlie parent sequence. In paiticular, the directed evolution methods of th« 
invention preferahi}' rrtkK.te k pluroh:}- of mutants which are identical to the paren 
sequence ckcep .r. no ■ n^;r'^;,irv.c:iL-aji; 'jl.'iar.trcsidtj;b<.iieniuuitt;d. Pohiiu'isliitvin, 
the mutant se^uencec uiay then be geneiated using pol>'mer .SYathe,si,s and or recombinan 



-54- 



iechpologies wcH iaiown in the an, and the polyme.fs baving ihcsc nustant j^tqucnc cs fa o liica 
pEt'iV-t\ihl\ scK-iiHiid for On; one or more prcpert'!t;s of tEiteiest. In pruUcuifsi, ra-liiods of 
directed evolution typically have, as their goal, the selection and/or identification of poiymers 
(in particular, modified poiymers) wherein one or more particular properties of interest are 

5 altered, and are preferably improved. For example, a directed evolution method inay have, 
as its goal, Ihe selection of polymers that have improved catal>'tic activity (e.g., a higher mte 
of catalysis), improved (e a., sf.oiiger) binding io -a pariicuiar iigand or substrate, or greater 
thermal stabilit^■. Tiieiefore, in i::cfe:-:-cc embodinv.^-.t? one or more ofthe mutant polymers 
arc selected where one or n.'^ . ■ ' ; ; .; -ue different i'rom the parent 

10 sequence. Preferably, ihe one or more pujjieiues of imcicst aic iinpro\c>i in the selected 
polymer sequences. 

In preferred embodiments, methods of directed evotoion may be repeated to generate 
and identify polymers where one or more properties of interest progressively improve with 
eaoh iteration. Accordingly, in a jprefeired embodirnsnt, one or more of the selected 

1 5 polymers may be selected as a new parent sequence, for i^e in a next round of iteration in 
tire directed evolution method. Crossover locations in the new parent sequence may then be 
identified and selected, and a second generation of mutants can be generated and screened 
as described above. Improved mutants may also be recombiiied if desired, using 
con ventional genetic engineering techniques, to obtain further variations md improvements , 

20 fhese processes may be repeated as desired, to obtain successive generations of mutants. 

Polymer Evolution Techniques 

Methods for the directed evolution of poiymers such as nucleic acids and 
polypeptides ate well known in the art. See, for example, Dube et al , Gem, 137:41 (1993); 
25 Moore & Arnold, hature DwlcJmolog 14 4^;^ loo cl al Na'm l 39m 670 (1 999) 

Uy'^'^). KjLolo\d^/a/ ^')o<. 1 ^ . . N i 9? I o""^ ( yOS> Wi^ ^ a\\ 6. 

Mmld,J MohcuIarhvoluHonA*}'^l(i{W^) See, also I .S P.'tcuiN ^5^41 691 au. 



-55- 



5 Jl 1,238; International Patent Applications WO 98/42832. WO 95/22625, WO 97/20078, 
WO 95/41653, and U.S. Patent Nos. 5,605J93 and 5,830,721. Generaliy, such mettods 
work by seiecting a parent sequence, typically a particular protein, and gsneiatiiig iarge 
numbefsof mutants, for example by error prone PGR of a gene encoding the selected protein. 

5 The mutants are then tested, preferably in a screening assay, to identify mutants that actually 
have an improved property detected in the assay (for example, increased catalytic activity, 
or stronger binding to a Ugand or substrate). These mutants are selected and again mutated, 
and the second generation of mutants is again tested to identify new mutants where tlie 
property is further improved. Thus, traditional directed evolution methocfe randomly search 

1 0 tliiough the sequence space of a polymer one residue, at a time to identify mutants with en 
increased fitness. 

Such traditional methods are limited, however, by the fmilc capacity of existing 
assays to screen mutants. Existing screening assays may observe and./or select iroin between 
about 10^ or IG'^ mutants, depending on the particular meffiod. However, for a typical 
1 5 protein of 300 amino acid residues the number of possible amino acid comfoinations is about 
1 0-'K) scretniing assays can only observe a small fraction of sequences in tiie sequence 
space of a given parent 

Using the analytical methods described above, a user can improve upon such existing 
methods by identifying locations on polj-mers that allow cro,5sovers to oocm- while 
20 maintaining their function and specifically selecting those locations for mutation in the 
iterative step of a directed evolution experiment. In preferred embodiments a tiser maj* 
identify and target residues that have crossover, locations that exhibit crossover disruption 
below a certain value in in vitro experiments. 

The invention encompasses, but is not limited to, the following examples of in vitix 
25 techniques: ( 1 ) fragmentation and reassembly techniques (e,g. tiie Stenimer DHA shuffling 
method, Steinmer. Nature 1994, 370:389; staggeixjd extension process (StEP)(2hao e 
d, ^r^in P.iOlc<.f n < / . " 2^" ,^3' ^yrllhesi> let bmqu s, ^ad (4) Pf'R hasec 
tdigv.'tinij U will be undeisto3d b> praotiUouers that these and oU.et m^:t^ods can be used ii 

-56- 



the teveation, and that these methods may be applied to any number of pareivts m\d cut 
points. Tiie i-econibinatidn techruques of the inventkm include vitro mid in vivo 
recombmation, as weli as methods which combine both approaches, and further, 
recombinants can be cloned and/or expressed by host ceUs according to known techniques. 
5 Fragmentation and reassemfoiy techniques uiilizs a reslTiction en2>'rtte or set of 

restriction enzymes at specific concentrations to selectively cut biopoijTner strands at 
identified locations. The choice and concentration of enzyrrie(s) are determined based ijpon 
the identified optimal crossover locations detennined by the method of the invention. The 
method can be applied to homologous imd non-l-oraoioanus nucleic acid sequences. The 
1 0 resulting DN.A. fragnients. produced by the restriction eoi-yme digest, caii be reassembled by 
techniques known in the art, thereby creating hybrid pai-ental DNA strands tirat can be used 
as templates for the production of proteins. The invention also encompasses the 
fragmentation and reassembly of amino acid sequences. The ftagmentationand reassembly 
may be accomplished, for example, by the use of chexiiical methods or enzj-tnes for 
1 5 homologous or non-homoiogous amino acid sequences. 

Alternatively, the StEP method ( Zhao c>/ al, Natttre Biotechmfosy 1997, 49:290; 
biases the creation of mutant hybrid proteins towards mutations at dcsii-ed orossovei 
locations. A set of DNA primers are synthesized to hybridize with equal probability to al 
pafental strands at desired crossover recombination location.s. The desired hybrid DN^ 
20 sequence can be created by chemically synthesizing the desired DN A sequence or ligatin< 
synthesized fragments of the desired DNA sequence. One method is to synthesize fragment' 
based upon optimal crossover locations from all the pai-eats and randomly anneal th< 
fragments to produce a recombinant library. A related method reduces the need to synthesize 
eadi Ml length parental gene by encompassing the use of overlap extension, a DW 
25 polymerase, and partial synthesis of the genes of interest to create the full length gene o 
interest 

yiEP RtvoruK i "p") \ioh :s iilu^tKUcd m VIOJ for two crossovers an- 

Iwo paicntal genes. Spin pool .synthesis can be used to minimize the .synthesis burden. Th 



-57- 



method of Volkov el at.,Nmi Acids Res., 27:18 (1999) may be used. As shown in FIG. 7, 
a "grey" parent and a "black*' parent 3.m each cut at positions 1 and 2. Crossover 
recombraation at thesecut points or crossover regions generates eight possible recombinants, 
including two that are identical to one of the parents. The remaining six recombinants have 
5 mutatit sequences with contributions from each parent that cross over to a contribution from 
tiie otlicr pdrt-m at one or both cut pniuii Sec. FIG. p^irt (A) E^ch of thp«e tccr a^htnan's 
CM1 m I le b% a benbl ot sviviicUc ir-'amuns ih't f-ont^in 1 1.- ci i points oi cjo ^o.ej 
locc tio 1-^ t e y least ot c ot 'vc t pai n( ragmen s io b< jotnfd con'it icsiduv.-^ i-'om ore 
01 tlic nhcr pLLtv^m h t oxicno pjsi ihe cm pvinl >s to m m FICJ. 7 part (B) la Ihis 
1 0 exa.Tipx, the teriTimdi tragmenti> have end primers th it iJKlude a cut point, rc^ultin^ ni k ur 
possible fragments on tlie iefi, four on the right, and two (one from each parent) in tlie 
1-niddl.e, These fragments can be reassembled in eight different sets of tiirea, to produce each 
of the eight recombinants in FIG, 7, part (A). 

/« Vitro ~ In Vim Recombination, A hybrid in vitro-in vivo recombinatiOB method 
1 5 is outlined in FIG. S. la FIG. 8, the method pertains to the shuffling of two parental genes. 
The method encompasses gene assembly tisiag svnthetic fragments and overlap extension 
witii feagments followed by gap repair, whicii creates double stranded sequences containing 
mismatched regions. The mismatches are thenrepaired randomly in vivo when inserted into 
an appropriate host cell in the form of a heterodnplex plasmid. This method removes 
20 parental homoduplexes and results in a library of random crossovers near the mismatched 
sites for each of tlxe two reactions, Ftrther complexity (more crossovers) cati be added easily 
by adding fragment correspondn^ to desired crossover pomts. 

In FIG. S, a "grey 'parent and f "bLck' parent re^-jresertL polymer^ (,c g genes), to be 
cut and reassembled at tv,'o oi 1 pamt^ Svnthetic fidiinients frr m eacli patent aie extendei' 
25 atacutpomtwcoaespond\Mlhthe . .c^. oi * otlitr parem, h\ u^smnheoihcrpart-ni 
as a tempi tt? Voi c\nnple fefioen - < rom tl-i- b'ack p lu-nt are extends i a 
des'gncited cm potnt^ v ith scviuencei- "rom ihtr pat em y-nit: the du^^ idi^xif t ^ 
template Tragments deuved horn the gicy pai-eni are hl^ewise cxtendei usmg thu bia^i 

-58- 



les'dues, repip ( lit Ei^ ikL <hlu . ^ ^ 

in the exanp^c of FIG. S, wHh two cut points, tv\o ^sls U'Ui Jutetcut duplexes «re 
possible, for a total of eight duplexes. These represent the eight possible recombinations of 
5 sequences from the two parents by crossovers at the two cut points. Two of tiiese duplexes 
are homoduplexes, meaning tiiat the sequences of both polymers are identical to each oilier. 
They are also each identical to one of the parent polymers. The temaimiig six duplexes are 
heteroduplexes, meaning thai the sequences of each polymer in the duplex pair are different. 
One member of each heteroduplex ha.s a sequence identical to one of the parents. The other 
10 member of each heteroduplex pair is a crossover recombinant, with a sequence that crosses 
over from one parent to the other at one or more of the cut points. In this example, with two 
cut points, a crossover can occur at one or both cut points, resulting in two sets of three 
recombiimnt sequences that differ &om parent sequences. As shovm in FIG. 8, these six 
crossover recombinants are (black-grey-black), (gr ey-grey-hlack), (black-grey-grey), and the 
1 5 "reverse" set of recombinants (grey-black-grey), (black-black-grey), and (grey-black-black). 

The duplexes produced by this method caa be introduced to an appropriate host cell 
for heteroduplex recombination, which serves to remove the parent homoduplexes. The 
result is a libraiy of crossover recombinants having sequences contributed by both parents, 
It will be understood that tliis discussion and FIG 8 is ati illustration of a general 
20 teclmique that is i-pplicabie to the inventions. For exairrpie, more than twd pai'eats ad/or 
more than two cut points can be used. 

PCR Amplification. Another method is otttfmed. in FIG. 9. Gene fragments for 
reassembly can be prepared by PCR with primers directed for crossovers. The primers can 
be desigwd suck that a single primer will hybridise equally to ail parent strands at the 
25 desired positions at crossover locations, the fragments prepared by these reactions are 
pooled and reassembled by PCR with flanking primers, e.g. 1+6 in the example. The 
resulting PCR products will have crossovers directed to locations of the primers. 



-59- 



As shown in FIG. % several sets of ptimers are mads for each paceat polymer. One 
set of primers cdtrespoiids to the temninai ends of the polyther. In this exmnpies there is one 
primer for each of tiie 3' and 5* ends of a polynucleotide, designated 1 and 6 in FIG, 9. Each 
remaining set of primers corresponds to each cut point, and in this example there are tvv'o 
5 primers for each cut point. These are designated 2 and 3 for the cut point at the left, and 4 
and 5 for the cut point at the right in Fi<3. 9. Similar sets of primers are prepared for each 
other parent. PCR amplification is performed using pairs of primers that flafik adjacent 
regions of the polymer, e.g. primers I and 2, primers 3 and 4, and primers 5 and 6. Al i of the 
po?;sible. fragments from all of the parents are reassembled in a pool, using PCR reactions 
1 0 starting with primers 1 and 6. 

Famiiy Shuffling, Another method is otitlined in FIG 10, which is a DNA shuffling 
method as described e,g, by the \ 994 Stemmer references, llie recombination is directed to 
specific sites utilizing "crossover" primers. The crossover primers are synthesized to contain 
crossover sequences and are \jsed during the reassembly reaction. The Goncentration of the 
15 primer can be varied and can be much higher than that of the parental genes . 

In tl-iis approach, sets of piirner pfrirs are prepared. Each primer of each pair has 
sequences from two parents vvhich span and ii:iclude 8 designated crossover locatioa. FIG. 
10, part (A). The parent genes are fragmented, and fragmems are reassembled in the 
presence of tlie primers using PCR ampUficatiom The primers promote reassembly and 
20 amplification at the crossover locations they span, to produce complementarj^ recombinants 
with sequences from more than one parent. FIG. 10, part (B). Two parents and two cut 
points are shown in this example, but more - may be used, hr the figure, a paitiail> 
reassembled sequence for one recombinant is shown, with tennmal sequences coming from 
one patent (black) and the middle or intervening sequences coming from another parent 
25 (grey). 

Tne methods described above and illusitrated by FIGS. 7-1 Q are novel methods fo! 
targeting optinial crossover location.s, in particular based on the techniqites calculations 
described herein, e.g. in Sec. 5.4 above. 



-60- 



Screeniiss: Hybrids Wu'n Protectad Schema 

According to the invctitiori, crossovers at locations that raintmally disrupt coupling 
interactioas wtUi other residues ate irtoie likely to lead to functional pioteiiis. By focusing 
the crossovers in a directed evolution experiment to residues having crossover locations that 

5 minimize the disniption of coupling interactions or domains, the number of" sequences thai 
must be tested or sereened is consideratjly reduced. 

Referring specifically to embodiments where the parent sequence is a protein or Other 
polypeptide se.quence, the parent sequence (and mutants thereof) may be expressed in facile 
gene expression systems to obtain libraries of mutant proteins. Any source of nucletc acid 

10 in purified form can be utilized as tlie starting nucleic acid. Thus, the process may employ 
DNA or RNA, including messenger RNA. The DMA or RNA may be either single or double 
stranded. In addition, DNA -RNA hybrids which contain one strand of each may be utilized 
The nucleic acid sequence may also be of various lengths depending on the size of tlK 
sequence to be mutated. Preferably, the specific nucleic acid sequence is from 50 to 50,00C 

1 5 base pairs. It is contemplated that entire vectors containing the nucieic acid encoding the 
protein of interest may be used in these methods. 

Once the evolved polynucleotide molecules are generated they can be clonisd into << 
suitable vector selected by the skilled artisan according to methods vs-ell laiown in the art 
If a mixed population of the specific nucleic acid secpience is cloned into a vector it canbi 

20 clonally amplified by inserting each vector into a host cell and allowing the host cell if 
amplify the vector. The mixed population may be tested toldentiiy tliedesired recombinan 
nucleic acid &agraent The method of selection will depend on the DNA fragment desired 
For example, in this invention a DNA ftagmenf which encodes for a protein with improve* 
propertie.s can be determined by tests for functional activity and/or stability of the proteic 

25 Such tests are well known in the art. 

Using the methods of directed evolution, the invention provides a novel means fc 
producing functional, and soluble proteins with improved activity toward one or moi 
substrates. The mutante cm be expressed in conventional or fecile expression systems sue 



-61- 



as E: coil. ConvetvJonai Lei^s can be U5;d to deierrnine wheiher a protein of inierest 
produced from an expression syyten^ has improved expression, folding and'or functioual 
properties. For exaBiple, to detennine whether a polyimcSeotide subjected to diiccled 
evolution and expressed in a foreign hosted! produces a protein witii improved activity, one 
5 skilled in the art can perform experiments designed to test the functional activity of the 
protein. Briefly, the evolved protein can be rapidly screened, and is readilj-- isolated and 
purified fron) the expression system or media if secreted. It can tiien be subjected to assays 
designed to test functional activity of the particular protein. 

A tlow chart of an exemplar,' directed evolution algorithm is illustrated in FIG. 14. 
10 A library of nii-ttants can be made by any of the method.s ds.'jcribed herein. "Fhe ii braiy can 
be sorted or restricied using the computational methods of tlie invention to identify the most 
promising subset of "fif * mutants. These can be screened to pick the most fit mutant This 
process can be repeated in successive generations, until no further changes are observed, a 
set goal is achieved^, or the process is ended at my desired step. 

15 

5.8 Impiementatton Sy stems and Methods 

Computer Sysfem. 

The analjlical metltods described in the previous subsections may preferably be 
implemented by the use of one or more computer systems, such as those described herein. 

20 Accordingly, FiG. 11 schematically iiiustrates ari exemplar}' computer system suitable for 
implementation of the analytical methods of this invention. Computer 201 is iilustrated here 
as comprising internal components linked to external- components. However, a skilled 
artisan will readily appreciate that one or more of the components described herein as 
"interna!" may, in alternative embodiments, be external. Likewise, one or more of the 

25 "external" components described here may also be internal. The internal components of this 
computer system include processor element 202 interconnected with a main memory 203 
For example, in one preferred embodiment computer j;>-.sre;i5 20 1 may be a Si) icon Graphic: 
RIOOOQ Processor running at 195 MHz or greater and Vvith 2 gigabytes or more of physica 



-62- 



memory, In another, less preferabie, exemplary embodiment computer system 201 may be 
an Intel Pentium based processor of J 50 MHz or greater clock rate and the 32 megabytes or 
more of maia memoi^. 

The external components may include a mass storage 204. This mass storage may 
be one or more hard disks which are typically packaged together with the processor and 
memory. Such hard disks are typically of at least 1 gigabyte storage capacity, and more 
preferably have at least 5 gigabytes or at least 10 gigabtyes of storage capacity. The mass 
storage may also comprise, for example, a removable medium such as, a CD-ROM drive, a 
DVD drive, a floppy disk drive (including a Zip^^' drive), or a DAT drive or other Other 
cKtenial components include a user interface device 205, wiiich can be, for example, a 
monitor and a keyboard. In preferred embodiments the user interface is also coupled with 
a poiiiting device 206 which may be, for example, a "mouse" or otlier graphical input device 
(not ilhistrated). Typically, computer sj^stem 201 is also Hnkesd to a network link 207, which 
can be pait of m Ethernet or other link to one or more other, lotal cdmputer systems ^e.g., 
as part of a local area network or LAN), or ths network link may he a lini; to a wide area 
comrsiunication netv/ork (WAN) such as the Iniernet. This network link allows computer 
system 201 to communicate with one or more other computer systems. 

Typically, one or more softv/are compoiients are loaded into main memoiy 203 
during operation of computer system 201. These software components may include both 
compoaents that are standard in the art and special to the invention, and tlie components 
coilectiveiy cause the computer system to function aocordingto tlie analytical methods of the 
invention. Typically, the software components are stored on mass storage 204 (e.g., on a 
hard drive or on removable storage media such as on one or more CD-ROMs, RW-CDs, 
DVDs, floppy disks or DATs) . Software component 210 represents an operating system, 
which is responsible for managing computer system 201 and its network intercoimections. 
This operating is typically an operating system routinely used in the art and may be, for 
example, a UNIX operadng system or, less preferably, a meniber of the M.iGrosofi 



-63- 



WindoAVs"''^^ famiiy of operating systems (for example, Windows 2000, Windows Me, 
Windows 98, Windows 95 or Windows NT) or a Maciutosh operatEng system. 

Software e0!-npoiient 211 represents common languages and fuHClioris coriveniently 
present in the system to assist programs implementing the methods specific to the invention, 
5 Languages that may be used include, for example, FORTRA'N, C, 04 + and less preferably 
JAVA. 

1 tie analytical metijudn oi *1k r-\ ;ntio!i nny also he programrreJ m mathematical 
software p^xka^^cb \^hicii .dlovv b', iislv c .'i} tuitions and high-level speciftcatioti of 
ptoccssnig. ncluntn^ ai^ToriLhtns t< \ . oby frt-oinc: 3 -tsf-' of ibe n'^ed t'"' 

ID pnKediUdli; p toi-p-.'iTi indtviduc^l i^q ; s i rr digoiiTliir^s. Fxamplcs of such packagcii 
uiobiJc >{dilA> i;om Mailiworkp iNarok, MassachuseUs), Mathemaucii from Wolfxam 
R.eseaix:h (Champaign, Hlinois) and S-PIus from Math Soft (Seattle, Washington). 
Accfjrdingly, software component 212 represents the analytic methodis of fihe invention as 
programmed in a procedural lahgaage or syixiboltc package. 

I S The memory 203 may, optionally, forthef comprise software cottipoftents 213 which 

cause the processor to calculate or deteniune a tlrree-dimsnsional structui'e for a 
macromolecule and, in particular, for a givsn polymer sequence such as a protein or nucleic 
acid sequeiice. Such programs aie well biovm in the art, and mimerous software packages 
are available. This software includes Swiss -PdbViewer (Glaxo Wellcome Experimeatal 

20 Research); Biograf (Molecular Simulations, Inc); 0 (generally used for crystallography); 
Explorer (MSI); Quenta, CHARMM; and Sybil (Tripos). Tlie memoiy may also comprise 
one or more other software components,, such as one or more otiier files representing, e.g., 
one or more sequences of polymer residues including, for example, a parent sequence aud/oi 
other sequences (for example, mutant sequences). The memory 203 may also comprise one 

25 or more tiles represeming the three-dimensional structures of one or more sequences, 
tncludmg a fi le representing the three-dimensional structure of a p.5rent .sefpience, such as t 
parent protein or nucleic acid. 



-64- 



Computer Program Pradmts. 

The inventtofi also provides computer program products which can be used, e g., to 
program or configure a computer system for impiementation of analytical methods of the 
invention. A computer program product of the invention comprises a computer readable 
5 medium such as one or more compact disks (i.e., one or more "CDs", which may be 
CD-ROMs or a RW-CDs), one or more DVDs, one or more floppy disks (including, for 
example, one or more ZIP™ disks) or one or more DATs to name a few. The computer 
readable medium has encoded thereon, in coinpiiter readable form, one or more of the 
soitware components 21.2 (FIG. 1.1) that, when loaded into memory 203 of a computer 
10 system 201, cause tiie computer system to implement analytic njethods of the invention. The 
computer readable medium may also have other software components encoded thereon in 
computer readable form. Such other software components may include, for example, 
functional languages 211 or an operating system 210. The other software components may 
also iBclude one or more files or databases indudiog, for example^ ties or datsibases 
1 5 representing one or mote poljfmer sequences (e.g. protein or nucleic acid sequences) and/or 
files or databases cepreserstiiig one or .n:iore tlii-ee-dimensional structures tor particular 
polymer sequences (e.g., three-diraensional structures for proteins and nucleic acids. 
S^tstem Implementation. 

ill an exemplar>' impIementatiorH to practice the methods of the inven tion a parent 
20 sequence may jirst be loaded mto the computer system 201 (FIG. U). For exanipie, the 
parent sequence may be directly entered by a user &om monitor and keyboard 205 ajtid by 
directly typing a sequence of code of symbols representing different residues (e,g-, different 
amino acid or nucleotide residues). Alternatively, a user may specify parent sequences, e.g.. 
by selecting a sequence i&om a menu of c^didate sequences presented on the monitor or by 
25 entering an accession number for a sequence in a database (for example, the GenBank oi 
SWISPROT dv^tabase) -and the ccmputsr system may access the selected parent s;equenc« 
from the database, e .g. , h>y -access.tng a database in menioiy 203 or by accessing the sequence 
from, a database over the network cosmeciion, e.g., over tlie internet. 



-65- 



The programs may then cause ihe computer sysicin iv obr.ain a three -diniensionid 
structure ofthe parent sequence. For example, the fhree-djmen.'^jonal stniciure for tha paro-^t 
sequence may also be accessed from a file (for example, a detabase of s^xuctures) \n \hc 
memory 203 or mass storage 204, Alternatively, the three-dimeivgjonal structure may also 
5 be retrieved through the computer network (e.g., over the network) from a database of 
structures such as the PDB database. In yet other embodimerits, the software components 
may, themselves, calculate a three-dimetistoaal structure using the molecular modeling 
software components. Such soi^vare components may calculate or determine a 
three-dimensional structure, e.g.. ah initio or may use empirical or experimental data such 
10 as X-ray cr>=sta!lography or NMR data that may also be entered by a user of loaded into tlie 
memory 203 (e.g., from one or more files on the mass storage 204 or over the computer 
network 207). The software components may further cause the computer system to calculate 
a confoiBiationa! energy for the parent sequence using the three-dimensional structure. 

Finally, the software components of tlie computer system, when loaded into memory 
1 5 203, preferably also cause the computer system to dctermme a coupling matrix or, in the 
altematis'e, a parameter related to oj coi i elating with coupling interactions according to the 
methods described herein. For example, the .software components ma\ cause tlie computei 
system to gct'5<. Mk* one k)t \noic miu.int icqucnccs oi the pdjcnt and. usmg th: conlorrUinjon 
determmed o. OwJtaiued foi dit,- uarcui sequence, dotcnnme i^ouplhig imeiaoiion^ ^mi well as 
20 disrupted couplitjg Interactions. 

Upon implementing ih&Hc analytic, methods, the computer sy.stem preferably tlien 
outputs, e.g., the coupling constants of the parent sequence or. the disruption profile of the 
mutant pool. For instance, the coupling interactions may be output to the monitor, printed 
on a printer (not shown) and/or written on mass storage 204, la preferred embodiments, the 
25 software components may also cause the computer system to select and identify one or more 
particular crossover lot^ations in the parent sequence for mutation, e.g., in a directed 
evolution experiment. For example, the compiiter system may identify residues of the parent 
sequence iiavlng as crossover locations that minimany disrupt coupling interactions. These 



-66- 



residueii could be idsnufK-d, for a user, ris ones which, if' mutated, are mo^n likely to in'iprove 
properties or the polymer in a directed evolution experiment while retaining tunction. 

Aiiernaiive systeiui ar.ci n.^thods for jniple-meiuing the an;ilytio m^'ikody of tbi.^; 
in\'cn.tion are. also iritended to be comprehended witlun the accomj^in.ving claims. In 
5 particular, the accompanying cteims are intended to include the alternative program 
structures for itnpiementlng the methods of this inveatioiv that will be readily apparent to 
those skilled iathe relevant art(s). 

6. MMmim 

1 0 The present invention is also described by means of particular examples. However, 

the use of such examples aimvhere in the specificatiojt is tilustrative only iuKl in no way 
limits the scope and meaning of the invention or of any exemplifjed temi. Likewise, tlie 
invention is not limited to any particular prefen~ed embodiments described herein. Indeed, 
many modiflcatious and variations of the invention will be apparent to those skilled in the 

15 art upon reading this specification and can be made vdtbout departing form its spirit and 
scope. The invention is therefore to be Uiruted only by the terms of the appended claims 
along wth the Ml scope of equivalents to which the claims axe entitled. 

6,1 Comput atlonan>eterminatt ott of StructuralSchema 

20 Structural schema of a biopolymer, e.g. a gene or protein., can be identified, and 

crossover disruption profiles of identified schema caxx be calculated. These calculations can 
be used to predict optimal crossover locations and .resulti ng recombinant offspring that are 
more Ukely to be stable, and exliibit new or improved properties. Schema disruption pmfiles 
can be based on eaergj' or distance calculations, or botlr. A prefeiTcd method, for its relative 

25 computations! efficiency, is based on interatomic distances. 
Crossover Disruption Based on Interatomic Distances 

Computing the distances betvt'een atoms, rather than a detailed energy calculation 
can significantly accelerate the calculation of aiupling interactions between residues. Tc 



-67- 



i .-.onti tit . iK ,1 ih. n n rtucfiFc '.i > I'totc n D..ldKui PDI! Uiogf U l^f rf 
f[.f ) !■> reao (!> t . s ^ < - . atom ot this stn cturt The cistances 

between all atoms are calculated with the equation, 

5 where d^j is the distance between atoms i mdj, and (x„>„2;) are the ihree-dimensionai 
coordtrsates of atom ?'. Two residues are considered catipkd if any of thdr atoitis (both side 
chain and main phain, excluding hydrogens) are within a cutoff distaa<;e <4. The paiaaieter 
4 is set such that the average number of coupling interactions per residue is between 4 and 
12. The preferred value for 4; 4.0 angstroms, corresponding to approximatsiy 7-8 

10 interactions per residues. A two-dimensional coupling matrix c is used to keep track of tiie 
coupled restdaes. An element of this matrix c,^ is equal to one if residues i and./ are witMis 
distance 4- and is zero othenvise. 

Despite the beneficial reductioi! in computation complexits'^ the disruption i-ssults 
based on distance ratiier than energy caiculations are not significantly altered, FIG. 15 

15 compares the calculation of the single-cut-point crossover disruption for transfbnnylasc 
based on the distance (top) and energy (bottom) definitions of coupling. The qnalitativ( 
shape of both plots is similar and a quantitative comparison of both measures ( yields 
0.91, FIG. 16(A) shows a plot based on energj', and FIG. 16(B) shows a plot based oi 
distance. Due to the significant improvement in calculation time, the distance-base< 

20 deftmtion of coupling is a preferred mode for the disn^tion calculations. 

The crossover disruption of a fragment can then be calculated using the equation 



(2 



ihc si nhnition o\or I b- R^idue< m liaertiento A*, ih the total nun ? m ntustJuc ^^yU*-le 
numucf Ox iCMdues m hagment o, c , is the coupling matiix ana m> the probabiiitj taat. two 
parents have different ammo acid identities at residue u The probabihties and ¥^ are 
5 determined by examirang a sequence aiignment of the parents and counting the nutuDcr of 
times that tiie parents share an amino acid identity at that residue according to; 



10 where « i if parent / and k have the same amino acid at residue /, axid is the total 
number of parents. Wlien the sequences of the parents are utiteown, or if it is desii-abie to 
count disruption at positions where the amino acid identities are identical, P, can be set to 
unity for all L Further, Equation (3) could be modified to retlect physio-chemical similarities 
(such as charge, hydrophobieit)', i\m) between amino acids, thus welding crossover 

15 dismption more heavily when comparing dissim ilar amino aei ds. 

Disrupting Folding Domains 

A data set generated by the experimental engineered shuffling of a thermostable 
phj-tase witiiamesostable phjtase yields further insight into the disruption caused by domaii; 

20 substitution. Jennutns, ct al, Sttucture-haaed chimeric enzymes as an alternative to directec 
ei^me evolution: ph>tase as a test case, J Bioieck, 85: 15-24 (2001). hi tills experiment 
two chimeric proteuts were created by extracting small domains (1,2) from the thermophilic 
{A. nlger) phytase and inserting them into the less-stable {A, terteus) phytase. FIG. 17(A; 
One chimera (HyA) was created by iiisertiag a surface helix (residues 66-82), FIG. 17(1) 

25 The second chimera (liyB) was created by insertmg a buried beta-steand (residues 48-58) 




(3) 



-69- 



(FIG. i 7(1). HyA2 was stabilized whm compared to the A. ierreus wifd-type and HyB 1 was 
significantly destabilized. 

This ts shown by the comparisonof mdtkig temperatures (T^) in degrees C for HyA. 
(mutant 2) and HyB (ttfutant I) reported in the Experimental Data of FIG, 17, The melting 

5 temperature of HyA (2) is higher than for HyB (I ), meaning it is relatively more slalsle (more 
energy, i.e. a higher temperature^ is needed to cause unfolding, Simikirly, the tempemtures 
at which S(i% of the HyA and HyB pioteins are unfolded (t^^) show that HyA is more stable 
and HyB is less stable, FIG. 17 also shows comparison data for thermodynamic properties 
of the Vv'ild tj'pe A. ierreus phytase enayme (wt) and the wild type thermophilic A. nigsr 

10 phytase (wt-iusert). 

To determine the disrupt! veness of the two domains, the cros.sover disruption v>m 
calculated for each domain insertion and statistically compared to the disruptiveness of ail 
fragments in phytase. The crossover disruption of tlic HyA mutant is 8. 12 and HyB is 
10.77 (FIG, 17, Calculations). While HyA is less disruptive, both compare well to the 

1 5 avemge crossover disaiption 1 9.26 (standard deviation 4 .09), calculated by determimng the 
disruptiveness of all possible fragmentSv To emphasize this torsnd, the Z-scf>re was calculated 
for each chimera, where the Z-score 2, of fragment ? is defined as: 



20 where is the crossover disruption of fragment /, <E^> is the average crossover disruptior 
of all .fragments, and s(£'^.) \s the standard deviation of the crossover disruption of al 
fragments. The Z-score of HyA is -2.72 and HyB is ~2.0S (FIG, 17), indicating that wMk 
HyA is predicted to be a more acceptable substitution than HyB, both have a ver^' lov 
disruption whm compared to the average. 



-70- 



Both ohirnetus has f rej;rJ\'ely low crossover disniptiori value? because they <'i.re both 
sn;al! fragnienis. Not-nvili^inv', iht ctos.^ovcr disruptioR measure by the number of residues 
in th<t ftagmem /V^,and the total number of residues this effecl, gtvea by; 



(5) 



where J^jT is the normalized disruption measure- Other posslbilitiebvinchide tiornialitingthe 
crossover disruption by the number of residues in the domain alone, 

10 aiidnormaliKingthesqujife-rootofthe^^^^^^ 
domam, 

When equation 3 is used to calculate the disruptiveness of the substiMions into phytase, 
HyA has a disruption of 0.005 and HyB has a disruption of 0.01.4 (FIG. 17). 'I"he average 

15 disruption for all possible fragments is 0.006 (standard deviation is 0,002). Hie Z-scores of 
the H>.A. substitution Is -0 8ft and the HyB substitiition 4 838, mdicating that by these 
uieabuici., the rivjrf subsuuuior .■^ fd- n,ore bkeiv tu he dc.-^ubiii^-ti.g. was founa 
cxpcuncnialiy (JeiTOUtu.^ d'^ a 200 i) in gencrai, H., nation { b) is the prtten cd rnudy < )f the 
oalculalion due to lijc Uck of dependence on the total number of re'^idm.s, 

20 The normalized value for crosso\ er disruption (Equation 6) can be used to determine 

the compatibility of isolated fragments when substituted into the remaining structure. As ac 
example, the crossover disi-optlon was calculated for fragments that appeared in the DNA 
shuffling experiment with beta-lactamase (Crameri, 1998). Each fragment independently 



-71- 



>.^iubt i lov cresses^:) d! >rupliO' Vvhnaijslo possiok ira;n I'jit' 1 no^Mi bJoio i 
c-^pe-iincnt tinsi^peof C'^kul,Ji>>iKOu'dbt used to con.puui hn i' y "i^^p iuu:: . su:^^v u^ut 
fragments that arc more hkely to produce, toided chimeras, based on their aisruptiveness ot 
the structure. This approach couM be applied to methotis of "exon shuffling" whereby 

5 paretit genes are fragmented and recombined at crossover points based on titeir natural 
intron-exoa structure on thfi gerie level. Koikiaan & Stemmer, Naiiire BmtechmIo0^M: 
423-428 (2000). The computational method is able to determine the sets of exoiis tliat aie 
least likely to be disruptive when sub.stituted into the structufc. 

R.ecotTibinatton car. cause disruptioa on two levels of the hierarchical protein folding 

1 0 process . Fi.ri?t, the intemal energy' caii be dist urbed by the substitution of a parental fragmerii 
thai disturbs the interactions that stabilize the structure (the crossover disruption). Second 
if a fragment causes a highly concentrated region of crossover disrupiioa, then this region it 
uolikely to fold. Even if the remainder of the structure has a low intemal energy (few broker 
coupling interactioiis), the locally imsfolded region would be severely destabilizing 

1 5 Combmed, the phytase and beta4actamase data sets support this view of disruption. In boll 
experiments, crossovers that distributed the disruption throughout the gene, rather thai 
localized regions of high crossover disruption generated stable chimeras. Practically, thi; 
implies that it is better to have a kirge absolute crossover dismption (Sarge total that i 
well distiibuted across the gene (low for all the fragments;), titan have a small absolut* 

20 crossover disruption (low total that is very localized (large for one fragment). 

Calculating Compact Units of Structure 

The current view of protein folding is that the process Is hierarchical. First, a ver 
last *'burst'' phase occurs where the unfolded polypeptide rapidly collapses into highlj 
25 compact units, such as alpha-helixes. Next, the substnictures condense into the tertiar 
arrangement of the native structure (FIG. 18). The experimental observation tliat folding i 
hierarchal lias led to the "building block" theory that proteins have submiits that fold an 
then assLst higher-level rearrangements. Tsat, C-J., et al. Anatomy of protein structure; 



-72- 



visual izing how a one-dimensional protein folds into a three-dimensiorta! shape, Proc. Natl, 
Acad. Sci. USA, 97: 1203842043 (2000). According to the invention, crossovers that do 
not disrupt these building biocits will be more hkeiy to lead to functional chiiiieras. 

A useM tool to visuaiixe local units of condensed structure ("building blocks") is the 

5 contact map. Rosstiian, M. G, & Liljas, A., Molec. Biol, 85: 177-1 8 i . The contact map 
is cOfiStructed by measming the distance between all alpha-carbons in the three-^d imensionai 
structtirc (Equation 1) arid then generating a two-dimensional matrix where residues that are 
within a cutoff distance are omrksd as white whereas residues that lie outside this cutotT 
distance, are iKarked as black. Domains that occur on t!te level of the one-dimensional 

1 0 polypeptide chain can be identiBed as trisi^les that can be drawn on the diagonal that do not 
contain any black regions OPIG. 19). Effectively, this identifies ftsgraeitts of the structure 
tiiat fold into a sphere of diameter 

Several algorithms have been proposed to divide the contact map into regions, thus 
identifying domains in the structure. See e.g. , De Souza, et al., Intron positions correlate 

1 5 with module boundaries in ancient proteins, Proc. Nad. Acad Sci. USA, 93: 14632-14636 
(1996); Gilbert, et al., Origin of CJenes, Pwe. Mail. Acad. Sci. USA. 94: 7698-7703 (1997); 
Go, M,, Correlation of DNA exonic regions with protein structural units in haemoglobin, 
Nahtre, 291 : 9(.)-92 (1 9S 1 ); Go, M., Modular sJiuctiiral units, exons, aiid fiinotion in chickec 
lysozym^, P^'oc- /w*'*- ^^cad Sci. US.4, 80: 1964-1968 (1983). 

20 Go originally proposed that lines should be drav/n that cross through the largest white 

regions with the intent to separate the black regions. Tins fragments the struottire intc 
domains in a way that minimizes the interaction between the domains. While this algorithn 
was cradely successfiil in demonstrating tlie correlation between, exons and subdomains, i 
often fails on complicated stnictures that do not have an obvious domain structure 

25 Msasuring the number of interactions at each site can quantitate this algoritiun. 



-73- 



whfic Ji,, = 0 if residues ; and J are cioser than 4<t« and A,; = 1 if residues / and J are farther 
iliaq According to Go, residues that minimize i?, are fnore likely to be regions between 
dojuaiuK, 

In this exai«ple a plot of for transformyiase was generated. FIG, 20(B). This 
5 algoritliin predicts that there are three dornain-forming regions in the protein structui-e (tk se 
Yaiiey$), whereas two were sampled in the in vitro recombination expetimeat(FIG> 20.4.). 
This indicates that, while crossovers in thi.s region couid form a domain, too many coupling 
interactions are disrupted between the fragments, thus leading to destabilized structures. 
Further, a calculated contact map (FIG. 21) and a plot of R, (FIG. 22) for beta-lactaniase 
1 0 sliow that, wliiie some crossovers occurred in regions that are predicted to .separate domains, 
this algorithm was relatively weak for predicting crossover locations. Other domain- 
separating algorithms based on analyzing the contact map have been proposed, but are not 
reliably consisterrt when analyzing the locations of crossovers in recombination experiments 
(Do Souza etal, 19%; Gilbert et al, 1997). 

15 

SCHEMA: Schsma-based Hybrid Protsin OpiimizaHon 

The present method identifies doniains ("building blocks") in proteins based on 
analyzing tiie contact map to opiiniize recombiriants based on schema. FIG. 23. This 
algorithm is based on searching the protein structure for regions that are compact, based on 

20 comparing the length of a fragment with the .size of the sphere into which the fragment folds, 
Gilbert and co-workers found that^ for a domain diatneter of 21 angstroms, the average 
ftagment that can fold into this sphere is 15 residues long with a standard deviation of 5 
residues (De Sonza et al, 1 996). In other words, if a fragment of 20 residues folds such that 
all the residues are within a sphere of 21 angstroms, then this fragment can be considered as 

25 being highly toirpact Further, if a fragment of 15 ieMdi'f! lolds mto a sphere of 21 
d ij^^ttoni then tiK onn ctacso of this unit i<; stftt^r v<?'b neiagc ^Hms obs'-n'-arion t< 
iiti' , heic b> ^ hooMf . "iM! *• a\ *' une ^rj^Uii „tlal li t frt^riKnt oi Jii'- ^ue O) 
gica^cr IS folded mto a sphere of diametei d^^^, ilicn this hagment r cun.=uit;icu to b^ 



-74- 



oiup c ^dtin tilt ojv predicts ti ' 



be d;S!-u,pted hy crossovers. 

To determine tr.e regions that aie compact, the eatu e pfotem struciurc us scanned wiih 
fragments of size «„,,-„ ami greater (FIG, 23). Each fragment is checkcxi for whether it can 

5 foid into a sphere of d,^ by inspecting tlie contact map for any regions of biack (residues that 
are separated by mote than d,^ aagstroms) in the triangle that defines the fragment If tliere 
is no back in tlie triangle, then a compact unit is defined and crossovers are disfavored along 
the fragment because this wouid disrupt a structural building block. To demark this, a 
schema disruption profile is defmed where higher values indicate a more disruptive event. 

10 The protlle is defined by 



where CM„„ is the element of the contact matrix corresponding to residues m and «, and d.{f, 
is a function that is equal to I if/ = 0 and equal to 0 if/> 0. Effectively, Equation (9) coimts 

1 5 the number of times that residue / is involved in a compact imit. A residue that has a largt 
Si value is involved in a more compact umt than a residue that has a low -5*; value. 

Making crossovers in building blocks that interact with structure is mors dismptivs 
than making crossovers in buiiding blocks that are isolated ftom the remainder of th« 
sti-ucture, Foilomng this idea, the algorithm combines the crossover disruption measurt 

20 (based on the disturbance of coupling interactions) with the domain-based, disruptioi 
measure to identify ti^ie compact units that are nucleating in folding (FIG. 23). To do this 
we add a tenn to Equation 9 such tlist fragments that fold into a compact unit, but are no 
interacting with the remainder of the structure are not counted in the schema disroptio: 
profile. The modified equation is 




(9) 



-75- 



svhere iiie tuiictioii g{x.y) is equal io \ if \' > y and 0« otherwise. The s^chsnrA disrupiion 
profile generated by Fquaiion (!0) ideutiftes the regions of (lie piojein \\\&\ are m voh. ed m 
a compact uiiil that significantly contributes to the stability of the protein (many coupling 
5 Interactions). If a crossover occurs in the.<;e regions, then it is more likely to have a 
destabiMng effect on the stnicture. 

In Vitro Recornhinatian Results: Beia-lactcmase, Transformylase. F450 

The results of tiie SCHEMA caicuiatioii on the transformylase mid beta-lactamase 
data sets using the schema-based algorithm are shown in FXGS. 24 aad 25, respecti vely. The 
10 algoritlmi rapidly lo«ites ihe regions in which crossovers are disruptive, llie advantages of 
the schema calculation over the a{igj«nen(-based aigorlthin are threefold. First, the 
calculation is detemiiutstic aiid does ixol rely orr sannpling or the method of computational 
hybridization that is used to recoaslruct drimeric genes in siUco. Second, the SCliEMA 
calculation only requires the structure fib and does not rely on the accuracy of aii aligtiment 
1 5 algorithm. Finally, the minima in the schema disruption profile are the optimal cut points, 
■whereas the maxima in tiie stochastic algorithm are the statistj.cally most likely cut points. 

The algorithm predictions were compared with an in vitro evolution experiment that 
reconibined low-sequence identity (25%) P45Qscc and;P450c27 genes. Pikuleva, et al: 
Studies of distant members of the P450 superfamily (450scc aiid 4S0c2?) by randon 
20 chimeragenesis. Archives of Biochem. And Biophys., 334: 183-192 (1996). In Ihi; 
experiment, several chimeras were generated that folded, into the native str-ucture. While the 
stmctures of sec and c27 are unknowTi, the structure of a mammaliaa P4S0 {ICS) wa: 
recently solved- The schenta disruption profile for the 2C5 structure was calculated (FIG 
26A) and was compared to the crossovers that resulted in folded chimeric sequences. Th 
25 equivalent locations for the crossovers were determined by running a BLAST alignment o 



-76- 



tbcH 'CM rt-gion around ihectosj:uv(„.s<io:cpOik^(lh> \V. .otn<u .nK'co \sinVcrs '^'jl'utt va, 
Bjoikhem &, Waterman, M. R., Archives of BwdtLni Ard P,, pti\ i , 3^ i 1 P i- 5 : ^ 9^6) 
These crossovers are in regions that are predicted to be the least disruptive. In another 
experiment, a bacterial iMSOcam and human P450 2C9 were recombined at a single cut 
5 point. Shimoji, M., et al, Design of a novel P450: a functional foacterial-human c>'tochrome 
iM^O <^h'm^ra, B^ochtn^ ,'/v, ^'^ 8848 S?52 (190g) Th<, Oi mvr« that ru!,ul ed from ihs 
rationallv dtsigncd L.UI pom ^ y ' i c ^luiK Tiit cios<;ovc ocrusTcd at a locatiot tbat 
s raniiTjJ 5- di antnt iiuh 1 i n-^ i minjBuminh "'C*? stmctuic 

HG 26B I\ thi U" er <. u lifr upp^rl ^(Kl ii n } tnn 

10 tdculatiOTis Set, also Ileranxke etai , kandom Circulai pi^nriutatton of Dsh^ rt; i^aK 
segments that are essential for protem toldmtj and stability, J. Mol. BtoL 28<i; U9 /-U15 
(1999); Pachenko, et ah, Foldons, protein structural modules, and exons, Proc, Natl Acad 
Scl USA, 93: 2008-2013 (1996). 

The optimal parents for eKperimental methods that restrict the fragmentation (such 
15 as DNA shuffimg, restriction enzyme approaches, exon shuffling) can be detenBined by 
analyzing the schema disruption profile. The parents, exons, or restriction enzymes can be 
cliossn such that the cut points occur at locations in the gene that minimize the schema 
disruption, 

FIG, 27A shows the total mrniber of po.ssible crossover locations for each parent 
20 based on a minimum of six nucleotide overlap between parents. The differences in the total 
number of crossovers correlates with the sequence identity shared between parents. For 
example, parent 1 shares tlie most sequence identity with parents 2,3, and 4 and parent 4 
shares the least sequence identity with patents 1, 2, and 3. FIG, 27B shows the number oi 
crossover points that are consistent with generating a low schema disruption (< 30, values 
25 from FIG. tSVf). Even though the total number of crossover points is greater for pai-eat 3^ 
parent 4 has more potential crossover locations that arc coosistent with preserving the 
.schema disruption. This provides aa explanation and possible mechanism for the 
experimentally-observed abseitce of parent 3 in the improved chimeras previously reportec 



-77- 



by Cianieri ei al, 199S. Thus, calcuUitioas and companions of this kind can be used to 
predict optima! sfttc. of parents for oroRsover reconibinauon. In thii; calculation example, 
parent 3 {Yersinia (mUirocoiiiica) would not be ustxl, because it contributes a relatively high 
crossover disruption in the schema disruption profile, in favor of the other parents, which 
exhibit iess crossover disruption. 

6.2 Cro ssovfcr Rfccomibinatioa of Lactams^se-Like Genes by DNA Shuffling 

This example describes experiments wherein the methods of the invention were used 
to evaluate a crossover probability distribution for a tamtly sh\jffling experiment wherein 
four different fWactainase-like genes (also referred to a?; cephsklosporinase genes) were 
recombiixed. (See, Cmiimei at Nature, 391:288 (1998). 

The tiuee-dlmensional structure for the backbone and side chain of tixe 
cephalosporinase protein expressed by Enter bacf er cloacae was retrieved from that protein's 
high resolution crystal structure. Lobkovsky et al, Prac. Nad. Acad. Sci U.S.A., 
90:1 1257-1 1261 (1 993). Additional sequence information for the protein was retrieved from 
the TrEMBL database. (Bairoch & Apweiler, Nucl Acids Res. , 28:45-48 (2000) (Accession 
No. P05364). Sequences for homologous proteins expressed by other organisms were also 
retrieved from the SWISPRO T database (Bairoch & Apweiler, i'w/?ra), including sequences 
for ceph.aiosporina.se proteins expressed by Citrabaclerjmmdii (Accession No. B05 1 93): 
Klebsiella pneumonia (AccessioiiNo. P048437)and YershiniaetiiercoMca(A<x.Qsdo&'i'i<} 
P45460). 

Alignment of parental sequences. 

FIG, 3 is a gene aiignment, using GAP, for four 04actamase-Iike genes: (l 
Enterabacter cloacae, (2) Ciirobacterfnundii, (3) yersinia enterocoUticamdi4)KlebsieUc 
pmufmnia. SWISPROT or TrEMBL accession numbers for the protein sequences am 
GenBank accession numbers for the DNA sequences are given. DNA sequences wer< 
iHr.f-vod iK.ni {ii<^ (icnHanl ddiy- V \ r -.a'^Qfee., X072/4, X6n49 an. 

X774f 5, respectively) These nucleotiui." scvUvUOc^w^jiP aisu dligncd, using the poiypcptid 



-78- 



sequence alignment shown in FIG. 3 to align codoiis of die DNA sequences tliat encoded 
aligned amino acid resicHses, 

Generation of crossover mutants, 

5 A library' of possible recombinant mutants was generated in sHica from the protein 

aiignments using all possible ''crossover locations " or "cut points" detertnined for (b$ nucleic 
acid and protein alignments. Specifically, regions of four sequential amino acids in a first 
aligned sequence that were identical at the same positiom m juiother aligned DNA sequence 
were identified as candidate crossover regioas for the affected parents. 

10 In this example, the paraiiieter of four amino acids relates to a minimum required 

DNA identity' shared between parents for DH.A hybridization to occur. On the DNA !evei, 
six nucleotides of shared identity are required for hybridization to occur, llie practical 
reason that the DNA limit (6) is lower tliaa tlie amino acid limit (4x3=12 nucleotide.s) is 
because tnultipie codons can encode a single amino acid. This requires tkiat a higher 

IS tlixeshold be used when calculating the possible crossover points based on an ammo acid 
alignment, Anotlier approach would be to calculate the thermodynamic energy of 
hybridization based on the specific base pairs on each parent- See Moore et al., Predicting 
crossover generation iti DNA shuffling, PNAS 98:6, 3226-3231 (2001). Also, melting 
temperatijre for the denaturation of the DNA overlap cm be calculated based on tiie G-C, and 

20 A-T content, In this example, aligrmieats are used to detennine where sequences can 
reamieai. AHgnmeiits are aot necessar>' for the calculations of Examples 6, 1 and 6.3. 

Two exemplary in silico methods were used to generate candidate hybrids ot 
crossover mutants, based on tlie set of possible cut points deteimined from the allgtmieni 
algoritiim. In botii methods, parental fragments are cut at the crossover locations whicl: 

25 satisfy' a predetermined crossover probability and are randomly recombmed at thost 
crossover points to produce a pool of rccombined or hybrid proteins. 



-79- 



Method I {Random Probability Model (^Fragmenf M^ 

To generate a candidate crossover mutant, a parent sequence was selected at random 
from the four cephalosporinase sequences. This sequence was written to Cite candidate 
mutant sequence up to a possible cut point. Upon teaching the possible cut point, & random 

5 number between 0 and 1 was chosen, and if the number was below a predetermined 
crossover probability P^, then a second parent was randomly chosen, (Note that because each 
nareut template is randvinilv rickcicc for extcnsioji at a crnssover point, the second parent 
coiild in somt- castas be the same 'jiS the i";rs: patent. ) Ths- mvAivM ^lequence was then extended 
injm die out poim usioi; the; scqrenct- --"f : v $e:or.,.i parent a icmplai;;, up untii a next cut 

10 point was reached. Ti^en, a random number between 0 and 1 was again chosen. If this 
nuBiber was below a predctenniacd crossover probability Pc, thentlis mutant sequence was 
extended from the cut point using another randomly selected patent as a template, up to the 
next cut point. In each case that the random number was not below a predetemiiiied 
Cfossover probabiliiy P^, the aiutaat seqtience was extended to the next cut point by 

1.5 contbdng with the same patent, ix,withoulcrossmg over to aiiother par^^^^^ The 
probabilit)' P^. can be the same or different for each cut point, These steps were repeated until 
the sequence was complete, e.g. a fuil-Iength hybrid protein was generated, comprising 
fragments of different parents recoinbined at selected cut points. 

This process was repeated many time-s, each time with a randomly selected paient, 

20 until between about 10"* to 10^ full length cephaiosporinase crossover mutants were 
generated. 

The crossover probability using Metliod 1 was based roughly on fraptient sixe and 
in this example was selected to Pc=0,30. In addition, a fiitther instruction was imposed 
where each polypeptide fiagmem must be at least ei^t amino acid residues in length before 
25 another crossover was allowed to occur. The minimum fexgmeht size of eight amino acids 
reflects a lower experimental bound relevant to the Stemnter protocol when the beta- 
lactamase genes were shuffled, in the DNA shuffling protocol, very small DNA ftagirseRis 
get "lost" in tiio reaction mbcture and cannot become pait of a recombinant mutant. Thus, 



-80- 



.this parameter is oaiy relevaiit for S{eimner4ske shuffiiiig experiaients and is not tmpo riant 
for other methods (e.g., StEP has no miriimum fragment mtc). This rule is not cotoectetl 
with disruption theory. Using these parameters, the average number of fragments per 
recombinant mutant was 13.4, corresponding to an average of 80-100 nuciet>tides per 

5 fragment. This was set to model resuhs that were previously reported in actual directed 
evoii tion e^pertroents See, FIG, IB FlfJ. 12 and Cramcn ei 391 2S8 1 1998). 

U^thodl fP^nlv^v pi > <, ^ < , I '> ul Pur^ntFragmsnts) 

An ai^ctnatl^^, nuthou ' l- u , ^ A c u lo l< -'^nerat; cindaldte ciossover 
tnu tit.ts liv \ h iillir g i hod is lopi^se Ued Jia^>i mnn -itu-alK b> FiG. 13. As 

} 0 shown by the arrows in FIG. OA, parental strands aie fragmented by randomly distributing 
cut points with probability Pc. hi tiie finite, the arrows mark cut points and the thatched 
lines represent regions of sequeiice similaiitj' between parents. In FIG. 13B, a parent is 
chosen at random to determine the first parental fragment. The next fragment is chosen 
amongst the parents that share adequate sequence identity (iiiciuding the pafsent of the 

15 previous fragment) with equal probability. If the outpoint at the end of the parent fragment 
conespoiids lo an idei^titled crossover iocat.ion based upon sequence tdentitv', as described 
above, the next fragment is chosen from tiu' pool of eli gible parents, including the parent of 
the previous parent. This process is repeated imtii an entire offspring is created. The 
complete librarj' of recombinant mutants that can be generated by the cut pattern shown. 

20 FIG. 13C When this method was utilized to generate crossover mutants, the crossover 
probability in this example, based on fragment size, was set at Pc===0.15, As in Method 1, a 
farther restriction was imposed so that each fragment must be at least eight amino acid.s in 
length. See also, FIG. lA. 

In this experimentv the number of fragment per mutant was 7, and the average 

25 fragment siz.e was 80-100 nucleotides per fragment. As in Method 1 , tliis is approxinratel) 
the .same number that has been previously reported. See, Crameri el al, Naiure, 391:28? 
(1998). 



-Si- 



Thcdistnbutonofcrosi.o\cr locations in tht ic-iiitirfJthrAE;, o'^iO niutiioi^ 
li ihowiiin FIGS. 4 A and 48 using Method 1, ^xi fit „s. U- n<i4i>ii !ii_Motli.!l The 
graphs provided in these figures indicate the probability, P<;, that a mutant randomly selected 
jSrora the Ubraiy has a crossover pofet at a given amino acid residue (horizontal axis). Tine 
5 solid bars beneatli the horizontal axis in this graph indicate residues where crossover points 
occunxid m actual. ft.Kctioral n utpnts previously identified, Cramcri et al {l^^Z) 

In lIun CAenpkrs nio 1. 1 tne ci u'^'^ovei piobab litj P^. js relateo to fragn->ent sue -=ind 
is sa.m a e^ u ^ je idue \ icmcK i ^ ckod dt landom ard a random nu'nbei is chosen 

10 sequence ihis is lepeaied \' timci,, vvl ete N ib the nuiiibti o le ducb il is fltcttivei) 
fragments the parent sequences for modeling puq^oses The cut points tiiat do not correspond 
to regions of adequate sequence identity are thrown out (FIG. 13) and the. remaining 
crossovers are used to create all the possible recombinaiit mutants. A probability distribution 
for crossovers along the gene can be calculated by fetking all the recombinant mutants 

1 5 generated by the algoritlun and keeping track of where tiie crossovers occur. The number 
of times a crossover is observed is normalized by the total tiumbsr of chimeric mutants 
generated to obtain a probability in sUieo. 

Caicuiat'mg coupling interactions. 

20 Using tlic high resolution crj-stal structure obtained for the cephasporinase protein 

expressed by Entsmbacter cloacae (LobkovslQ', supra,), coupling interactions were 
identified for all pairwise combinations of amino acid side chains. Specifically, the coupling 
interactions were delineated for each pair of amino acid residues, / and j\ as a stability 
parameter ^(hjX with stability corresponding to the calculated energies of hydrogen bonds. 

25 electrostatic interactions and van der Waais interactions bctNs^-en side Oium To calculatt 
SI ch pairv/<;e mteractions. the DREiDlNG force field (M >^ o <^tnt / "^fi) ^ ( 1 t^n 94 88'^'; 
J'f'G;) s\v ^ uj.:; 1 111 a mociflou ^or.n lat uKludco a K c^orc et^i a< 

pre\ lou'^iy »1. seabed (Dahijat <■/ al , Science , 6*13 ^3 t,19S' ')) l'>^a-\M c i outihiuo i.t > o 



-82- 



residues eM)ibitu% kteractioii eneigtes vvhose. absolute value is greater than 0.25 kcai/moi 
were considered coupled for these examples. 

Pair-wise interactions were also calculated using ORBIT protein design softv^'are that 
included parameters for hydrogen bonds, e!ectrostatic interactions and van der Waals 
5 interactions between side chains. 

Deter mimng crossover dlstupdm of mutants. 

Having thus identified pair-wise interactions between residues of the parent 
Enterobacter cloacae cephalosporinaKe proteins, each of the recomi>ination mutants 
generated in xiiico was examiiicd to identify conpiirsg interactions that were 0!"ginaUy 
10 present in the parent sequence(s}j but had been disrupted in the crossover mutant. This 
demonstrates that Methods I and 2 generate random pools of mutants in siiico that model the 
random pools generated in actual shuffling experiments. According to the invention, rules 
based on coupling interactions are used, as described, to eliminate many of the randomly 
geaierated candidates from consideration; the model foctises on optirniiih eaadidateg 
i 5 which are more likely to e>{hibit desirable propertieis . 

The average crossover disruption for the pool of crossover mutants shown in FIG, 
4A was E(;; ~- 407, Tlie average crossover disruption for the pool of crossover mutants sho'^vti 
in FIG, 4C was Ef. = 44. 
Scremfng mutant libraries. 
20 The in siiico crossover mutants were separated into two logical "bins", Tlie first bin 

included ail crossover mutants generated. The second bin contained tiiose mutants that 
exhibhed a lower level of crossover disruption. In particular, the crossover mutant^i in the 
second bin had a crossover disruption level that fell below a preselected threshold (E,),^^) 
"Fhe subpoots of FIGS. 4B and 4D represent those chimeras that have crossover disruptions 
25 beiow the respective thresholds. The average crossover disruption is compared with i 
thi-eshold, i'ot insiatice, the threshoid of 75 (FIG. 41?) was applied to the larger poo! when 
the average disruption u?s 407 (FTG. 4A). The disruption threshold of 1 8 (mCi. 41)) wa; 
taken from the larger pool where the average disruption was 44 (FIG, 4C}. hi the first case 



-83- 



the smaller pooJ represents i% of the larger pool and in the second case, the smaller pool 
represents 7,5% of the larger paoi. The differences in the average dSsniption of the two large 
pools (40? versus 44) reflect several differences in the algoxithi« that was used to generate 
them. The difference Is primarily due to the fact that amino acids that are identical in all of 

S the parents were scored as disruptive when generating the crossover disruption of the 
chimeric mutants in FIGS. 4A and 4B (p. Pj ^ 1 .0 in Equation 2). 

The distribution of crosso -ver locations rhai produces minimum amountii of crossover 
disruption is shown in FIGS, 4B and 4.D. The calculated probability Fc ofa crossover event 
occurred at a nucleotide corresponding to each amino acid residue of the protein. The 

10 crossover probability is deteimined by countiag the number of times a cut point occurs as a 
certain reisidue in a pool. The pool is generated by sequence identity alone or the pool the 
combines sequence identity with disruption. The mimber of times a cut point is observed is 
divided by the total number of chimeras in the pool. For example, if P^, (or P^^^j) of residue 
25 is 0.02, tM.s means tbat2% of alichiiiverttiTEititants had a crossovers at residueSS. While 

15 the unscreened pool of mutants has an even disiribulion of crossover points in areas oi 
sequence identity between the parental strands (FIGS. 4A and 4C), the screened pool has 
an unesfcui dtstributioa of crossover points that are conceiiirated at the sequence termini. 

Comparism with Previous Experimenti'. 

20 FIGS 4A and 4C show the probability distiibution for cut points in p-iactamas( 

calculated using DNA shuffling methods based upon sequence similarity. The variance k 
the raaxiinum probability at each crossover is caused by the number of parents that shart 
sequence identity at that point. The grey bars beneath the horizontal axis indicate actua 
crossovers that weteobscr\'edin prior experiments (Crameri etal , Nature, , 39i;288 (1 998)) 

25 Tlie wid th of these bars is due to tiie inability to resolve the cut point to a single residue, du* 
to tiie sequence similarit}^ between parents. FIGS. 4B and 4D show die probabiUt 
di.stribattan when the additional constrMnt of low crossover disruption is imposed W'imd 



-84- 



As can be readily deterromed ftom FIGS. 4A and 4C, the imscreened pools of 
mutant hybrids show even distributiori of crossover points in areas of sequence identity 
between the parental strands. However, as cm be determined from FIGS. 4B a nd 40, once 
the total pool is screened for mutants with minimal crossover disruption, the areas of parental 
5 sequence homology do not exhibit an even distribution of crossover points. For the proteins 
in these examples, computationally determined favorable crossover locations are found 
mainly at the termini of the sequences. These findings paralleled empirical observations 
(e.g. Cranieri, supra. ) The few crossover locations not located at the termini of the sequence 
correlated v.'eU with data frora experiments by Stemmer. See, Stemnjer, ?roc. Natl. Acad 
10 &j. aS.A., 91:10747 (1994); and Siemmer, Nature, 370:389 (1994). 

The crossover probability distribution for the tamily shuffi-ng experiment, insilico, 
starting with Citrobacter freundii, Kkbseilla pmummtms, Entembacler cloacae, and 
Yersima merocolUica also corresponded well to previous in vivo experimental data where 
Yersinia &nt&rocoimca was not observed in a pool of mutant offspring. Crameri, supra. 
15 For example, FIG. 6 shows the crossover probability distribution for tlie family 

shufiling experiment with and without Yersmia snterocolUica as a starting parent The 
dashed gray line repre.sents the probabilit^f distribution for combinations of Citrobacter 
freundii, KlehseiUa pneumoma<;, EiUerobacter cloacae. The solid black line is for the 
mutants containing the Yersinia enierocolitica sequence. I'hc inclusion of Yersinia 
20 enterocoUtica to the set of starting parents leads to the creation of a pool of recombinant 
mutaiit offspring witli an increased probability'' of greater disruption of coupling interactions 
than the pool of recombinant mutant offspring without Yersinia enierocolitica. The 
generation of crossover disruption profiles, also called schema profiles, provides & 
mechanism by which tlie optimal parents can be determined. 
25 Demmtnaiion of Optimal Parents Based on an In SiUco Chimera Ubrary 

Given the explosive growth of the gene databases due to the exhaustive sequencing 
of large numbers of organisms, sequences of homologous genes are easily accessible 
Currently, the choice of starting parents tor fairuly shuffling is arbitrary or is made witi 



-85- 



juinu ic^i mt iim.ninr, ic g , avdiUbihtv, bccuence similautv) lo L-^tc c is ao nii< u>u 
l•nethD^ to o,u iuTUmivJ/ i tHc jti.onnaif on in the sequence <Iaia'i it. t f hif olity uptiinai 
startmj, parenls. Foi example, uithc bcia-kctamase experiment discussfo above, btciiunei 
md co-workers shuffled four genes, but only three of these genes were found in the improved 

5 recombinants (Crameri et aL, 1 998). The fersinia enierocoUtica gene (parent 3 ~ f'lG. 27) 
was not observed in the top mutants. 

To undei-staad this effect, the methods of tJte invention were applied to calculate ail 
cf ihc pos=;ib{e iccoi-rsbinaiion locations bcuseen paieuK (1) Bnifrobactcr doacae (T) 
CwQPOC ) ju..rd>i.{-) U . (^A,. uK)},>} h>ron\i lUi,!". 

10 A potenu^j cro^^'over v\d. reco.ucd I^h e.c > isgion .^hdjed hctv^^-^n two pdicnis liial had 3J\ 
nucleotides in common. For instance, the total number of potential crossovers for parent 1 
is the sum of the number of potential crossovers for parents 1-2, 1-3, and 1-4. The 
differences In the total number of crossovers for each parent reflects the sequence ideivtitj' 
shared between parents. Because parent 3 shares more sequence identity with pai-ents 1 and 

\ 5 % than pamnt 4 does with parents I and 2, the total number of potential crossovers is greater. 
However, when tlie additional constraint of having a low schema disruption is imposed, 
parent 4 has more potential crossovers than parent 3, This provides a mechanism by which 
parent 4 was observed in tlis improved chimeras, but not parent 3, 

20 6.3 Single Cat Point Recombination 

The invention can also be applied to sets of parent biopolymers that do not share 
sequence identity, and using recombination methods lhat do not rely on sequence identity 
At least two methods are known in the art for producing recombinant gene libraries haviii^ 
cross-overs at any position, regardless of sequence identity. See, e.g., Ostermier et at 

25 Nature Biotechnalogy, 17:1205-1209 (1999); Ostermeier al, Bioorg, Med. Chem. 
7:2139-2144(1999); Siebei etal. Nature Biotechmlngy, 1.9,456-460(2000), Inparticulaj 
tiisse methods allow genes (and their coiseipor.ding pols-peptick'-s) that have dtvetge 
nucleotide sequences to be recombined. However, in the experimental impiemeutatio 

-86» 



described here only two pareiit sequences are recombined with only a single cut point 
(crossover). 

A method of tiie mvention was used to simulate the recombination of PurN and 
GART glycinamide ribonucieotide transformylase (Ostermier €i al , Natws Biotechnology, 

S 17:1205-1209 (1999)). A coupling matrix was calculated using the three-dimensional 
stnscture of PurN previously described by Aimassy si at. {Proc. Had, Acad Sci, US.A., 
89:61 14 (1992)). Tiie crossover disruption was flien calculaied for each possible single 
crossover mutaEt. FIG. 5 provides a plot showing the crossover disruption calculated for 
each mutant, indicated by the amino acid residue of the crossover location, 

10 The raageof amiriOaoid sequences showrtin FIG. 5 (i.e.j arniiio acid residues 50-150 

of tlie aligned glycinamid ribonucleotide transformyiase proteins) correspond to crossover 
regions where tion-horaologous recombiuations mre previously constructed by Beakovic 
et al. Nature Biotechnology^ 17:1205-1209 (1999). Crossover locations for functional 
cros.^Qver mutants that were iderillfied in these previous experiments an: tadi^ on the 

15 graph m FIG. 5 by horizontal lines. The vertical lines show the positions Vi'here siagle 
crossovers occurred and led to functional enzymes. The "2" indicates that ihis crossover was 
sampled twice in tlie library. The diamonds show where homologous recombination(DNA 
shuffling with single cut points) experiments produced crossovers. The calculated crossover 
dismption decreases rapidly outside of 50-1 50 amino acid sequence region, indicating that, 

20 as expected, crossovers would be strongly biased towards the -and C-termini of the parents. 
Local crossover disruption minima are also present in the region between amino acid 
residues 50-150 shown in FIG. 5. The,se minima reflect the fact that glycinatnid 
ribonucleotide transformyiase proteins comprise at least two topoiogicaliy separate domains. 
Thus, the local minima in crossover dismption reflect crossover points which occur at the 

25 intersection of such sepaiate domains, 

ilbe is_p.ou.cdb> persons of 0{dm.>iV 'ikill m the .^rr th^l ifi.> o\ impiet.lKier 
are jllusUJia\ t u.i!>, aud du not limit the scope of the inveiiTior or U e ciCf^onqun^arr i- Liitas 

-87- 



WECLAM: 

1. A method for sekotiiig a erossover iocatioii hi a first biopolymer having a firs!: 
polymer sequence, for recombination with one or more second biopoiymers each having its 

5 own second polymer sequence, which method comprises: 

identifying coupiing interactions between pairs of residues in the fi^'st polymer 
sequence; 

generating a pluraHtj' of data structures^ each data stnicture representing a crossover 
mutant comprising a recombination &f the fu st arid a second polymer sequence wherein each 
1 0 recombination has a di fferent crossover location ; 

detemihuag, for each data structure, a crossover disrupdcri related to the manbcr of 
coupling interactions d isrupted in the crossover mutant represented by the d ata stnic ture ; mid 
identifying, among the plurality of data structures, a particular data structure having 
a crossover disruption below a threshold, 
i 5 wherein the crossover location of the crossover mutant represented by the particular 

data stmctute is the identified crossover location. 

2. A method of claim 1 , wherein the particular polymer sequence comprises a sequence 

of amino acid residues. 

20 

3 . A method of claim 1 , wherein the particular polymer sequence comprises a sequence 
of nucleotide residues. 

4. A liaethod of claim 1, wherem coupling interactions arc identified by use of ? 
25 coupling matrix, 

5. A method of claim L wherein tiic coupling matrix is the summation of all ih 
coupling interactions of the first polymer sequence. 



-88- 



6. A method of ciaim 1 , wherein coupling interactions are ideriLified by a determinaiion 
of a conformational energy between residues. 

7. A method of ciaim 1 , wherein coupling interactions are identified by a detennination 
5 of in teratomic distances between residues, 

8. A method of claim 6, wherein conftjfmational energies for each of the first and 
second polymer sequences are detemiined from a tiuree-dimensionai structure for at least one 
of the first and second polymer sequences. 

10 

9. A method of claim 7, wherein interatomic distances for each of the first and second 
polj'mer sequences are determined from a three-dimensional structure for at Jeast one of the 
first and second polymer sequences. 

15 10. A method of claim 2, wherein coupling interactions are identified by 8 
conformational energy between residues above a threshold. 

11. .4 method of claim 1, wherein acoupling interaction bet-weenapair of residues in tk 
first poiyrner sequence is disrupted in a crossover mutai^t wherein a coupling interactioi 

20 between a pair of residues is disrupted in a crossover mutant if the identity of botii i-esiduei 
participating in the coupling interaction is different thaii that which exists in any of th< 
parents, 

12, A method oftlaimS, wherein acoupiinginteracfionbetweenap^^^ 

25 first polymer sequence is disrupted in a crossover mutant wherein a coupling interactioi 
betv^'een a pair of residues is disrupted in a crossover mutant if the jdentitj' of both residue 
participating in the coupling interaction is different than that which exists in any of th 
parents. 

-89- 



13. A method of ciaim 1, whsretn the crossover disritption is the sun^mation of ail 
coupled irtteractions in the parent tliat are considered disrupted in the data structure 
representing the cK.>ssover mutaiit, 

5 14. h method of clmm I, wherein the threshold Is aii average level of crossover 
dimtption for tii© phn ality of data structiu-es. 

1 5. A method of claim 1 ,whereift the threshold, is at least one staiidatd deviation below 
the average level for tiie plurality of data structures. 

io 

16. A method of claim 1, wherein the tbreshoid is set so that approximately 7.5% of the 
total aumber of generated data structuies is below tlie tlueshoid, 

17. A metiiod of claim 1 , wherein the threshold is set so that approximately 1% of tk 
i 5 total number of generated data structures is below the threshold. 

18. A method of claini 1, wherein the threshold is set so tliat approximately 0.001% a 
the total number of generated data structures is below the tiureshold, 

20 19. A method of clairnl, wlierein the generatioB of crossover mutajits comprises: 
the sequence aligtiment of a plurality of biopoly mers; 

the identification of pos.sihle cut points in the biopolymer based upon regions o 
sequence identity identified by the sequence alignment; and 

the generation of single crossover mutants based upon the identified possible ct 

25 points. 

20. A method of ciaim 19, wherein the regions of sequence identity must contain at lea- 

4 residues. 



-90- 



21 . A method of claim 1 9, tKere must be at least eight residues between crossovers. 

22. A method of <;{aim I, wherein the generation of the pimaiity of data structures 
comprises: 

5 the sequence aiignment of a plurality of biopolymers using simulated annealing with 

non-hoTOoIogous paxeRte; 

selecting crossover locations based upon tiie ininimization of crossover disruption, 
fragmeat size, starting miitiber of parents; and 

ti^e generation of a plurality of data structures based upon the identified possible 
10 crossover iocations, 

23. A method of claim 1, wherein the generation of tbe plurality of data structures 
comprises: 

choosing oneof t(ie MopQiyin«i'S &om the pluiaiity of biopolytaecs at random; 
1 5 copying tiie blopolymer until a possible crossover location is tericbed; 

choosing a random number between 0 and 1 ; 

choosing a riQW bi opolymer from the plurality of biopolymers to copy to the offspring 
if the random nunxber is beiow a crossover probability and 

repeating the above process imiil the data striicture representing the crossover mutaat 
20 is tite desired length. 

24. A method of claim 1 9, wherein the generation of tiiie plurality of data structures base, 
upon identified cut points compri<>es: 

cutting the biopolymers in into biopolymer fragirieats by randomly assigning cu 
25 points with a set probability; 

randomly choosing one of the biopolytner tragments m a starting parent; . 
randomly identifying another biopolymer fragment from the total pool of th 
biopolymer fragments; 



-91- 



iigati tp the i>kaufiea b.opoisinct i^d^njen u> ihc fa.eiit ft - j,menf, jf th.> iJehitfjjd 
fragment has J k^qucfKuMdcnlity cut-potnt at the o d ot the n^'^iD^n. ,nJ 

repeating the randomly idcntil>«ig hiep anvd the daUi iirut.Uut', rLpiescnnng t! e 
crossover mutant is tiie desired length, 

5 

25. A method for directed evolution of a poiyiaerv whieh niethad eomprises steps of: 
providing a plurality of parent polymer sequences; 

identifying crossover locations in the parent polymer sequences for recombination 
according to claim 1 ; 

10 generating oneormorenmtantpolyraer sequences uiiUzingrecombinatory techniques 

targeted at the identified crossover locations on the parent polymer sequences; 

screening the one or more mutant sequences for the one or more properties of 
interest; and 

selecting at least one mutant sequence where one or more properties of interest are 
\$ identified, 

26. A metliod according to claim 25, wherein the method is iteratively repeated, and 
wherein at least one mutant sequence selected in a first iteration is a parent sequence in e 

second Iteration. 

20 

27. A metiiod of claim 25, wherein the recombination tcclmiques are selected from tte 
group consisting of: DNA shuffling, StEP method, fTagraentation and reassembly, synthesis 
and random-priming recombination. 

28. A computer system for analyzing a polymer sequence, which computer systen 
comprises; 



-92- 



memory and a processor uilei connected with tin. UKiKop, anJ hdsi.ig ojtc or more 
software compoaents loaded ihticifi, W'luein thj on: oi !u>ui ooHwiie components cause 
the processor to execute steps of a method sccuiding to claim I . 

29. A computer system of claim 28, wherein the software components comprise a 
database of polymer sequences. 

30. A computer system of claim 28^: wherein the software G^mponeMs comprise a 
database of three-dimensional structures for polymer sequences. 

31. A computer program comprising a computer readable medium hawing one ot more 
soiftware componems encoded in computer readable form, wherein the one or more software 
componejits may be loaded into a memory of a computer system and cause a processor 
mterconnected with the memory to execute steps of a metliod according to claim 1 . 

32. A computer program according to claim 30, wherein ths computer (^^dabte mediuni 
fiuther has, encoded thereon in computer readable foriii, a database of poly mer sequences, 

33 . A computef program according to claim 30, wherein the computer readable mediujr, 
further has, encoded thereon in computer readable form, a database of tijiee-dimensiona 
structures for polymer sequences. 

34. A computer system, for analyxing a polymer sequen.eej which eomputei systesa 
comprises; 

memory and a processor interconnected with the memory and having one or mop 
software components loaded therein, wherein the one or more software components caus 
the processor to execute steps of a method according to claim 19. 



-93- 



35. A computer prograiTi cot-nprising a computer readable niediuni having oeic or jaoie 
software components encoded in computer readable form, wherein the one o t more softvv'are 
components may be losded into a memory of a computer system and cause a processor 
interconnected with the memory to execute steps of a method according to claim 19. 

36. A computer system for atialyzing a polymer sequence, which computer system 
comprises: 

memory and a processor intercormected with the memory and having one or aiore 
software componenls loaded therein, wherein the one or more software components cause 
the processor to execute steps of a metliod according to claim 23. 

37. A computer program comprising a computer readable medium having one or more 
software components encxjded m computer read abb form, wherein the one or more sofh».'are 
components may be loaded into a memory of a computer system and cause a processor 
interconnected with the memory to execute steps of a method according to claim 23. 

38. A computer system for analyzing a poljatter sequence, wliich computer system 
comprises: 

memory and a processor interconnected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
tlie processor to execute steps of a method according to claim 24. 

39. A computer program oornprising a computer readable medium having one or more 
software components encoded in computer readable form, wherein the one or more softwaw 
components may be loaded into a memory of a computer system and cause a processoi 
interconnected with the memory to execute steps of a method according to claim 24. . 



-94- 



40. A coittputer system for anai>'ziRg a polymer sequence, which computer system 
Comprises: 

memorj' and a processor intercorniected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
the processor to execute steps of a method according to claim 25. 

41 . A method for producing hybrid polymers from two or more parent polymers 

comprising tlie steps of; 

identifying structural domains of at least one parent polymer; 
Oigaxiizing ide^nifled domains into schema; 
calculating a schema disruption profile; 

selecting at least one crossover locatioii based ori the schema disruption profile; and 
recombiningtwo ormoreparent polymers atone or more selected crossover locations 
to produce at least one hybrid polyriier. 

42. A method of claim 41, wherein paxent polymet^s are recombiiied in siUcQ, in vitro. in 
vivo, or in any combination thereof, 

43 . A method of claim 41, 'A'iierein parent polymers are recombiiied in siUco to produce 
at leai^ one candidate hybrid polymer. 

44. A method of claim 43, wherein parent polymers are physically recombined at one oi 
more crossover locations, mcltiding at least one selected crossover location, to produce a- 
least one hybrid polymer corresponding to a caadidate hybrid polymer. 

45- A method of claim 44, wherein parent pohTJie^s are physically recombined in vitro 

46. A method of ciaim 44, wherein parent polymers are physically recombined in vivo 



-95- 



47. A method of claim 4 1 , wherein each parent polyiner comprises a polypeptide. 

48 . A method of claim 44, wherein each parent polymer comprises a polypeptide, 

49. A method of claim 41, v^tesin each paifent polymer comprises an ol|gomJc.ieotide. 

50. A method of claim 44, wherem each parent polymer comprises an oligbtiucleotMe. 

51. A method of claim 44, wherein the parent polymers are one of polypeptides and 
oiigoimoleotides, and wherein parent polymers ate recombiaed in a directed evolutioisi 
experimeat 

52- A method cjf claim 44, comprisiiig the step of screfeijiiig hybri<i polymers for one or 
more properties. 

53 . A method of claim 51 , comprising the step of sGreeiiing hybrid polymers fof one or 
more properties. 

54. A method of claim 5 1 , v/hercin the directed evolution experiment inclndes at least 
one protocol selected from the group consisting of fragmentation and reassembly, family 
shtiffiiiig, cxoit shijfflir^ StEF, ITCHY, synthesis techniques, and PCR-based techniques. 

55 . A method of claim 44, wherein hybrid polymers are expressed by host cells, 

56. A method of claim 41, wherein hybrid polymers are expressed hy host cells. 

57. A method of claim 53, wherein hybrid polymers are expressed by host ceils. 



.§6- 



58 A fiethod oi chim Ai whetetn c ocsintn l(v,ationi are scUvJed hom a schema 
Jf.siupnon [irotiL- [ Ji,ed a puNucuo-i (hat the seleca-d crossovers will tend to produce 
relatively less schema disruptioa than other crossover locations. 

59, A raethod of claim 44, whereia crossover locations are selected trora a scfaema 
disruption profile based on a prediction that the selected crossovers will tend to produce 
relatively less schema disruption than other crossover locations. 

60. A method of claim 41 , wherein crossover locations are selected based on a schema 

disruption threshold . 

6L A method of claim 44, wherein crossover locations are selected based on a schema 
disruption threshold. 

62. A iBcthod of claim 51, wherein crossover locations arc selected based on a schema 
disruption threshold. 

63 . A. metiiod of claim 4 1 , wherein crossover locations are selected to peser\'e schenig 
from at least one parent polymer. 

64. A metiiod of ciaiai 4 1. , wherein OK>ss<?ver locatioHs are selected to preserve schem? 
from a plurality of parent polymers, 

65. A ifhethod of claim 44. wherein crossover locations are selected to preserve scheffl? 
torn, at least one parent polymer. 

66. A method o f ciahn 44, wherein cixjssover locations are selected to preserve schem 
from a plurelity of parent po]>'mers. 



-97- 



67. A method of daim 51, \».teeia crossover locations are seiected to preserve schema 
from at least one parent polymer. 

68. A method of daim 5 1 , wherein crossover locations ai« selected to preserve schema 
from a plurality of pat ent polymers. 

69 . A method of claim 44, wherein a library of candidate hybrid pdlyiiiers is compared 
with v^i library of physically recombined hybrid polymers. 

70. A method of claim 5 1 , wherein the sequence space of a directed evolution expenment .. 
is reduced based on a library oiin silico candidate hybrid candidate seqaeiices . 

71. A method for producing s library of hybrid polysaejs comprising thss steps of: 
choosing two or more parent polymers; 

identifyiag stmctural domains of at least one parent polymei-; 
organizing identified domains into schema; 
calculating a schema disruption profile; 

selecting crossover locations based on the schema disruption profde; 

recomblmngt%vo or mors parent polymers at one or more selected crossover locations 
to produce a set of hybrid polymers; 

repeating at least the choosing and recombining steps to produce at least one 
additional set of hybrid polymers; and 

generating a librsucy of hybrid polj-mers from tiie sets of hybrid polymers. 

72. A metljod of claim 71 , wherein the repe-ated si&pn compii."5e choosing at least oni 
hybrid polymer as a parent polymer. 



73. A method of claim 71, wherein recombining steps are perfonned in mUgo. 



74. A method of claim 73, further coniprtsing physically recombining parent polymers 
at selected crossover bcations io prodxice hybrids in the library. 

75. A method of claim 7 i , wherein schema are common to at least two parents. 

76. A method of claim 74, wherein schema are common to at least two parents, 

77. A method of claim 7 1 , wherein, a schema disruption profile is calculated based oo one 
or both of conformational energy and interatomic distances. 

78. A method of claim 75, wherein a schema disruption profile is ealeulaied based onone 
or both of conformational energy and interatomic distances. , 

79. A method of claim 73, wherein parent polymers are physically recombined in a 
directed evolution experiment, 

80. A method of claim 79, wherein the directed evolution experiment mcludes at leas? 
one protocol selected from the groap consisting of fragmentation and reassembly, famil> 
shuffling, exon shuffling, StEP, ITCHY, synthesis techniques, and PCR-based techniques 

81. A metliod of claim 74, further compiising screeanng hybrids in the library for one o. 
more properties. 

82. A inetiiod of cla.itn 73, further comprising pliysically recombining parent polymear; 
at .selected crossover locations to produce hybrids in the library and screemng hybrids in th 
library for one or more properties; and wherein the repeated steps comprise choosing at lea£ 
one hybrid polymer as a parent polymer based on scieening restjlts. 



-99- 



83. A raethod of claim 41, wberein schema comprise doHiains ideritified according to 
sequence aUgrsments between two or more parent polymers. 

84. A method of claim 7!, wherein schema comprise domains ideattfied according to 
sequence aiigmtients between two or more parent polymers. 

83, A method of claim 41, wherein the crossover location comprises a crossover region. 

86. A method of claim 71, wherein tlie crossover location comprises a crossover region. 

87, A triCthod of claim 41, v4iereie the schema disruption profile comprises Jltaess 
contributions of polymer residues of one or more parent polymers. 

88, A metliod of claim 71, wherein the schema disruptloa profile comprises fitness 
contributions of polymer residtses of one or more parent polymers. 

89, A method of claim 41 , further comprisiag the step of calculating a crossover disruption 
profiie. 

90. A method of eiaihi 71, further comprisihg the step of calculating a crossover disruption 
profile, 

91 . A method of claim 41, further comprising resiixicting the selection: of crossovet locations 
based on at least one predetermined constraint 

$2. A method of claim 91, wherein the predetennined constrmt is based on a protocol foj 
physically recombining the polymers. 



-100- 



93. A method of ciaini 92, wheretn the predetennmed constraint comprises at least one of 
a requirement of sequence identity between parents, a cotisttamt on &e number of 
crossovers, and a constraint on the location of crossovers. 

94 . A method of claim 41 , further comprising the steps of gensratinig a coupling matrix and 
using the matrix in at least one of the identi fy iiig, o rganixing, calculating, and selecting steps. 

95. Amethodofclaisn 71, further comprising the steps of genorating a cotipling matrix and 
using the matrix in at least one of tiie identifying, organizing, calculating, and sekcting steps, 

96. A me&od of claim 4i , wherem domaiiis are identifi^^ on se^quence uiformation 
for at least one parent polymer, 

97. A method of claim 71, v^ierein domains aye idsaitified based on sequence informatioo 
for at least one parent polymer. 

98. A method of claim 41, wherein domains are identified based on a crystal structure for 
at least one parent polymer, 

99. A method of claim 71, wherein domains are identified based on a ciystal structure foj 
at least one parent polymer, 

100. A method of claim 71, wherein crossover locations are selected fixjm a schemj 
disruption profile based on a threshold disruption value. 

U'l. A method toj ino-.^eliri> th,^ u^onsbinationoftwo or more parent polymers comprisin; 
tlxe ausps of: 

obLaining structural informadon for at least one parent polymer; 



-101- 



evaluating cotipling irJeractioiis beiween polymer residues based on the structural 
mformation; 

ideiitiiyivig domains based oatiie determined coupling interactions; 

calculating the crossover disruption of the identified domains to produce a disruption 

pmfiie; 

applying a predetemiined tteeshold dismption to each domain of the disruptiors 

profile; 

at least one of, accepting domains which satisfy the threshold and rejecting domains 
'.vhich do not satisfy the tiucijhold; 

repeating at least the identifying, calculaiing and applying steps until each identified 
domain is accepted ox rejecte-d; 

designating the accepted or rejected domains as disaiptive; 

selecting crossover regions from domains that are not designated as disruptive; and 

recombining parent polymers at selected crossover regions. 

102. Amethodof claim 101, wherein the stiepofidentiiying 

the polymer residues which belong to each domain, and the step of selecting crossover 
regions comprises specifying one or more residues within at least one non-disruptive domain. 

103. A method of claim 101, wherein the tlu-eslvoid disruption represents a maximam 
allowable disruption, domains having a disruption above the tiireshold ai-e accepted as 
disruptive and are preserved, domains having a disruption below the tkeshold are rejected 
as non-disruptive and may be altered, and crossover regioiis are selected from residues 
belonging to non-disruptive domains. 

104. A method of claim 103, wherein domains hiving ad:isruptioneqmil to the awesholdart 
one of accepted as disruptive or rejected as non-disruptive. 



-102- 



105. A method of ciaim 102, wherein fee selection of crossover regions is restricted 
according to one or more recombination constraints. 

106. A method of claim 104, whereia the selection of crossover regions is restricted 
according to one or more recombination constraints. 

107. A method of claim 1 05, wherein tlie corkstrauit comprises at least one of a requireraent 
of sequence identity between parents, & constraint on the number of crossovers, and a 
constraint on the location of crossovers. 

108. A me&od of claim lOfi, wherein the coostraint comprises at ieast one of a requirement 
of sequence identity between parents, a constraint on the number of crossovers, and a 
constraint on the location of crossovers. 

109. A method of ckiin 105, v/heitan the constraint coiitiprises a re^juireniertt of sequence 
identity between parents, and the method fiirti^er comprises: 

obtaining ssquetice information for the parent polymers; 

aiigning the obtained sequence information; and 

identifying cut points within aligned regions of the parem seqirences. 

i 10, A metliQdof claim 109, v^tere &e step of identifymg ciit points comprises sekctiug cu' 
points having a relatively low crossover disruption, and the step of specifying a set o: 
parental fragments for recombination based on seleciaed cut points. 

Ill, A method of claim 44, wherein parental polymers are genes, and tlie polymers art 
physically recombined by a staggered extension process (StEP) comprising the steps of; 
Specifying one or more seledtal crossover locations; 



-103- 



cutting each of two or more parent poljuiers wkhin one or more crosso ver regions 
that each encompass one or more specified crossover locations to define a set of polymer 
fragments; 

producing a sst of defined polymer fragments, wherein each fragment has an end 
primer comprising a sequence with residues that extend past a specified crossover location; 
and 

assembling at ieast one of pair of fragments having sequences which overlap an end 
primer of at least one fragment of the pair, to produce a recombinant polymer. 

1 12. A method of claim 1 1 1, wherein the producing step comprises syntheazmg two or 
more fragments. 

U 3 . A meihod of claim 112, whereinsyntliesizing tragments comprises split pool synthesis. 

1 14, A method of claim 111, wherein fragments are aasembled by extension from an end 
primer. 

U 5 . A method of claim 111, wherein the set of defined polymer fragments comprises all oi 
the fragments arisu,ig from cutting all of the parent polymers within all of the crossovcj 
regions tibat encompass all of the specified crdssdver locations. 

116. A method of claim 115, wherein all of the fragments are assembled in all of thi 
possible combinations, 

117. A method of claim lU, fhrther comprising the step of screening one or mon 
recombmant polymers for a property. 



404- 



118. A method of claim 74, wherein pamiitai polymers are genes, and the polymers are 
physically fecombined by a staggered extension process (StEP) eoniprisiag the steps of: 

specifying one or more selected crossover locations; 

cutting each of two or more parent polymers within one or more crossover regions 
that each encompass one or more specified crossover locations to define a set of polymer 
fragn:tents; 

producing a set of defined po'ynier fragments, wherein each fragment has an end 
primer comprising a sequence with residues that extend past a specified crossover location; 

assembling at least one of pair of fragments having sequences which overlap m end 
primer of at least one fragment of the pair, to produce a recorabinatii polymer. 

119. A method of claim 1 18, wherein fragments are assembled by extension from an end 
primer. 

120. A method of (siairn 1 1 1, wherein the set of defined polymer fragments comprises all of 
the fragments arising from cutting all of the pai-ent polymers withiti all of the crossover 
regions tiiat sncampass all of the specified crossover locations; and wherein all of tiie 
fragments are assembled in ail of the possible combinations. 

121 . A method of claim 44, wiierein the parental polymers are genes, and the polymers are 
physically recombined by an in vitro-in-vivo tecombination method comprising the steps of: 

shufSling at least two parent polymers to -produce a set of parental fragments having 
selected crossover locations; 

assembiing fragments at crossover locations by overlap extension and gap repair, tc 
provide double stranded .sequences containing misniatched regions; and 

repairing the mismatched regton.s in vivo by }i^se^ting the double-stranded sequence.' 
into a host cell to pro vide a libracy of crossover recoiubinants. 



-105- 



122. A tiiethod of claim 121, wherein the double str^ded sequences are inserted into a host 
ceil in the form of a heteroduplex pUsmid. 

123. A method of claim 121, wherein parental homoduplexes are removed. 

1 24. A methc5d of ciait\-s 44, whsrein the parental polymers are genes, and the polymers sire 
physically recombined by ai\ i« vicro-in vivo recombination meiliod comprising the steps of: 

specifying one or more selected cut points; 

preparing synthetic poiynner fragments having sequences corresponding to the 
sequences of pax'ent polymers that are cut at specified cut points; 

extending tlie sequence of each fragment at a cut point against a parental template to 
produce a set of polyiner duplexes representing different combinations of fragments; 

removing parent homodupiex polymers; and 

providing a ^et of recombinants from the resulting heteroduplex polymers. 

125 . A method of claim 124, svhereia parent homoduplexes: are removed; by iiiseiting the 

polymsi; duplexes into a host ceil. 

126. A method of claim 125, wherein the polymer duplexes are inserted into a host cell 
in the form of a heteroduplex plasmid. 

127. A method of claim 74, wherein the parental polymers are genes, and the polymers art 
physically recombmed by an in vitra-in vivo recombination method comprising tlie steps of 

speclfs?iag one or more selected cut points; 

providing polymer fragments having sequences corresponding to the sequences o 
parent polymers tiiat are cut at specified cut points; 

extending the sequence of each fragineat at a ml point against a parental template t« 
produce a set of polymer dtiplexes representing different combinations of fragments; 



-106- 



removing parent honioduplex polymers, and 

providing a set of recombinatits from the resulting hetsroduplex polymers. 

128. A method of claim 127, wherein parent homoduplexes are removed by iiiserting the 
pofymer duplexes into a host ceil. 

129. A method of claim 1 28, wherein the polymer duplexes are inserted into a host ceil 
in the form of a heteroduplex pla.smid. 

130. . A metlwd of claim 127, foitlier comprising the step of screening one or more 
recombinant polymers for a property. 

13 1 . A method of claim 44, whereir/the parentalpolymers are genes, and the polymers are 
physically recorabined by a PCR amplification method comprising the step."5 of: 
specifying one or more selected cut points; 

defming polymer fragments having sequences corresponding to tlie sequences ol 
pai-ent polymers that are cut at specified cut points; 

providing sets of primers,, wherein each primer in a set hybridizes to ail parent strandt 
at a crossover region corresponding to a specifed cut point; 

producingasetofdefmed fragments ftotneach pai«nt poiyrnerby^ 
with each set of primers; and 

assembling fragments in a pool by PCR amplification. 

132- .A method of claim 131, wherein: 

each set of primers is a pair of terminal primers or a pair of inten-'enihg primers; 
each primer in a tenni nal pair of primers corresponds to at least one terminal end o 

one parent polymer; and 

each primer in each intervening pair of primers coiTe3po).ids to a specified cut poiir 



-107- 



133. A method of claim 132, wherein PCR anipiification is performed using a first primer 
selected a tlfst pair of primers, and a second primer selected from a second pair of primers. 

134. A method of claim 133, wherein the first and second primers flank {be ends of a 
polymer fragment. 

1 35. A method of claim 74, wherein the parental polymers are genes, and the polymers ar« 
physicaily recornbinsd by a PCR amplification raeth<xl comprising the Sttps of: 

specifying one or more selected cul points; 

defming poiymer fragments having sequences coiTcsponding to the sequences of 
parent polymers that are cut a( specified cut points: 

providing sets of primers, wherein each primer in a set hybridizes to all parent strands 
at a Gros.sover region corresponding to a specified cut point; 

producing a set of defined fragments from eacii parent polymer by PCR. ampi ifi catior 
with each set of primers; and 

assembling fragments in a pool by PCR amplification. 

136. A method of claim 135, vvtetein: 

each se( of printers is a pair of terminal primers or a paii- of intert'emag primers; 
each primer in a terminsil pair of primers corresponds to at least one temiihal end o 
one parent pol>'mer, and 

each primer in each intervemng pair of .pri.mers corresponds to a specified cut poini 

137. A method of claim 136, vvherein PGR ampiification is performed isiiig a first prime 
selected a first pair of primers, and a second primer selected from a second pair of primen 

138. A mefliod of claim 137, wherein the first and second primers tiank the ends of 

polymer firagment. 



408- 



139. A method of claim i'3U fiirther comprising the step of screening one or more 
recombinant polymers for a property. 

1 40. A method of claim 44, wherein the parental polymers are genes, and the polymers are 
physically recombined by a faraily .shuffling method comprising the steps of: 

specirying one or more selected erossover locations; 

providing sets of primer pairs, wherein each ptiiner of eachpair comprises sequences 
from ivto pfuerit polyajers; which span and include 3 .specified crossover location; 
pioduciru.' fraguicTtls ofilK- par^sv. pob -.iier.?: 

rehsscinhliuji the nagnicnfs in xlu presence cif tlie primers using PCR amplification, 

141. A method of claim 74, wherein the parental polymers are genes, and the polymers are 
physically recombined by a family shuftling method comprising the steps of: 

specifying one or more selected crossover locations; 

providing sets of primer pairs, wherein each primer of each pair comprises sequences 
from two patent polymers which span and include a specified crossover location; 
producing fragments of the parent polymers; 

reassembling the ftagments in the presence of the primei-s rising PCR amplification. 

142. A method of producing recombinant oligonucleotides from two or more paxen? 
oligonucleotides by a staggered extension process comprisirig the steps of: 

selecting one or more crossover locations for each parent oligoiiucieotide: 

cutting each of two or more parents within one or more crossover regions that eacl 
encompass one or more specified crossover locations to define a set of fragments; 

producing a set of defined fragments, wherein each fragment lias an end prime 
comprising a sequence with residues that extend pa.st a specified crossover location; and 

assembling at least one of pair of fragments haviiig sequences which overlap an en< 
primer of at least one fragment of the pair, to produce a recombinant oligonucleotide. 



-109- 



143. A i^elhori of producing recombinant oliv^onucleoiicles from tv.ij oi njore paicnL 
oUgomKietnides liy an m viiro-in vivo recombittalion method comprising the steps of: 

selecting one or more crossover locatiorts for each parsul oUgonucieoiide; 

shuffling at ieast two parent oUgoiiucisotides to produce a set of fragments kwing 
selected crossover locations; 

assembling fragments at crossover locations by overlap extension and gap repsir, to 
provide double stranded sequences containing mismatched regions; and 

repairing the mismatched regions in vh'o by inserting the double-stranded sequences 
into a host cell to provide a library of crossover recombinants. 

144. A method of producing recombmant: oligonucleotides from two or more parent 
oiigonucleolides by an in viwo-in vivo recombination method comprising the steps of: 

specifying one or more selected cut points for each parent oHgonucieotidc; 

preparing syntiietic polymer fragments hawng sequences corresponding to tlic 
sequences of parent oiigonucbotides that are cut at specified cut points; 

extending the sequence of each fragment at a cut point against a parental template tc 
produce a set of oligonucieotide duplexes representing different combinations of fragrnente 

ienioving parent homoduplex oligonucleotides; and 

providing a set of recombinants from ti>eresntting heterodnplex oligon'acieotides. 

145. Amethodofclairn 144, wherein tiisoUgonucleotide duplexes are i-emoved by insertia} 
oUgonuciesotide duplexes into a host cell in the form of a heteroduplex plasmid. 

146. A method of pitiduciug tecQmbinant oUgonucleotides from two or more parei? 
oiigonucleotides by a PGR ampUfication method comprising the steps of: 

specifying one or more selected cut points for each parent oligonucleotide; 

defining ol igonucietJtide fragm eats lia v i n g ssqu ences corresponding to the sequence 
of par ent oligonucleotides that are cut at specified cut points; 



-110- 



piov id:ng s,cts c f pdmer^ v^ lici cm each pr^ : i n u sci hybndijcr to all parait stimds 
at a crossover region correspoadmg to a specifted tui puisk, 

producing a set of defined fragraeiits from eacii parens: by PCR ampiification with 
each set of primers; axid 

assembling fragments in a pool by PCR amplification. 

H7. A method of claim 146, whesrein: 

each set of primers is a pair of terminal primei^ or a pair of ititervemng primers; 

each priiuer ia a temiiaal pair of primers corresponds to at least one terminal end of 
one parent polymer; and 

each primer in each intervening pair of primers corresponds to a specified cut point, 

148. Ametiiod of claim 147y wherein PGR ampHfication is performed using a festprlmea: 
selected a first pair of primers, and a second primer selected from a seamd pair of primers. 

j 49 . A method of ciaim, 1:48, vN-iierein first and second primers tlank tiie ends of a jjagment 

150. A method of producing recombinaiit oUgonucleoiides from two or more pasenl 
oligonucleotides by a family shuffling method comprising tlie steps of; 

specifying one or more selected ci assove;- location.^ tor each parent oUgonucleotide 
providing sets of primer pairs, wbereixieach primer of each pair comprises sequence: 
from two parents which span and include a specified crossover location; 
producing fragments of the parent polymers; 

reassembling the fragments mthe presence of the primers usin^ PCR amplification 
is}, A method ofclaim I, wherein a coupling Interact^^^^^ 

first polymer .sequence is disnipted in a crossover miUaiu if tlie identity of a residue i 
different in tiie cros.sover mutant tiian, in the fir.st poiymer sequence, and wherein a couplin 



interaction between a pair of residues is scaled by the probabilities tiiat iiie identity and 
sequence position of the coupled residues are the same in both parents. 



-H2- 



1/25 



Determiumg Possible Fragments Based On Experimental Restrietions 



Select 
Parents 



Obtain 
Sequences 



I Align 
•>i Sequences 



Identify cut 
points 




Cut Point 
Restrictions 










Detemiine 
fragments 




Fragment 
Pool 





Save in 
fragment file 
and cut point 

me 



FIG, lA 



2/25 



Delermtnhig the ScKema 0isr«ptbn Profile for a Striicture 



Reject 



j Acce^ I 



Obtain 
Structure 
File 


1 


r 


Ider 
Con 
Intera 


itify 
ctions 






^ Identify 
Domain 




r 


Calcu 
. Cros 
Disra} 
Do 


ate tiie 
sover 
3tion of 
caaia 



Done 
1 





Mark all 

domains 
tliat are 
disruptive 


Identify 
Optimai 
Crossovers 





Identify 




Load 


Restricted 




fragmsnt file 
or cut point 
file 


Optimai 
Crossovers 






FIG. IB 



3/25 




/ 




B 



I 1 3 J A 5 6 T ; » » le 13 n 




c 



2 3 4 5 6 ? 9 8 10 11 12 



Residue Ntmiber 



FIG 2 



5/25 




100 125 150 175 200 225 250 276 300 325 350 

Residue Number 



FIG, 4A 



100 125 160 176 200 22S 250 275 300 32S 360 

Residue Number 



FIG. 4B 



6/25 




e ?.j so 75 100 -ssii 1** "f* 2fifi ifle -tyi ^ic 

Residue ^fori^i!3(?^ 



FIG. 4C 



ise us B!o 3SS aso are jes 
Residue N^jfn&sr 



FIG. 4D 



8/25 



(A) 

AN possible recombinants 
prepared by crossover 
at positions 1 and 2 



(8) 

Thsse can bs prepareci by 
assembly of synthetb 
fragments containing the 
crossover positions 

Requires fragments 
{pius end primers): 



FIG. 7 



9/25 



Bcten^on of s^^thd:ic 
fragments gainst a 
parent tenptate strand 
and gap repair 



and 



(+, 



/ heteroduplex recombination 
{rernove parent homoduplexes) 

iibrary of recombinants 
with crossovers in regions 
of non-identity 



FIG. 8 



10/25 



(A) 

R-epare the fragments by 
P3R v/sth prirrers; perform 
reactions witti primers 1+2, 
3+4 ar>d 5 +6, 

and do same for other parent(s). 



cto(i>ie 
strand of 
parait 1 



(B) 

f^asserrble fr^ments In a poot foy PCJJvwth 1+ 6 

FIG. 9 



m 

Prepare crossover primers 
designed to have crossovers 
at designated positions (2 
primers for each position). 



parent 1 
, parent 2 



(B) i I 

Fragment parent genes and FOR re^semble in the presaico of the 
crossover primers to promote recombination at designated positions 



FIG. 10 



11/25 




FIG. 11 



12/25 

Recombmant search akorithm 



1. Align parent sequences 
with template structure 



1. Determine all possible m ssover oints 
accoiding to secmence identity alaoritiim 




3. Calculate coupling matrix 




4, Pick start pareiit at random 
and copy to ofTspring antli a 
possible cut point is reached 



5. Pick random number, if iess thaaj?, 
copy random new parent until liext cut 
poiitt is reached. 



6. Determine crossover 
disruption of offspring gene 



Offspring 

FIG. 12 



13/25 



i i i 

(A) 




(C) 



FIG. 13 



14/25 



DIRECTED EVOLUTION ALGORiTHM 



Create diverse library of mutants using 
random mutagenesis and/or 
recombination 



Determine the fitness of a sorted 
fraction of the mutant liijrary 



Pick most frt mutant and continue to 
the next gmeration 



Goal is achieved 



FIG. 14 



15/25 



40 




0 -I r 1 « 1 « ! 1 f 

0 10 20 30 40 

Ec(Energy) 



FIG. 15 




FIG. 16 



17/25 




2 



Experimental Data: 





wt 


wt-insert 


1 


2 




(B) 


Tm(dC) 


52 




55.2 




n.d. 




54.3 




Tm{dC) 


49.5 




53.3 




44.5 




52.5 






t1/2 


12.1 




2586 








87.5 






t1/2 


53 




138 




4 




308 






Cafculations; 




















A!i schema 


Fragments 


Z-SGore 




av 


stdev 


1 


2 


1 


2 


Ec 


19,260 




4.090 




10.770 


8.124 


-2.076 


-■2 723 


Ec» 


0.006 




0.002 




0.014 




0.005 


4.838 


-0.857 



FIG. 17 



18/25 




FIG. 18 



The contact map shows residues that are distant (black) and 
residues that are close (white). If a given segment, m^m.^^...««.^.mm 
folds an above average number of residues into a given sphere 
size, then it is compact. 




Number 



FIG. 19 




FIG. 20 



21/25 



M 


V' 




i i I 


^...1 Z'L 






m 



Residue # 
FIG. 22 




X 

!s rejected 
bscasise the 
fragment size 
is less than 15 




O 

is accepted 
because the 
fragrrvsnt is > '15 
residues and fits 
into the sphere 




O 

(s accepted 
because the 
fragment is > 1 5 
residues and fits 
into ihQ spj-i&re 




X 

\s rejected, 
because the 
fragment does not 
fit into the sphere 



(1) Pick a Sphere size (21 angstroms, like Go-Gtlbert) and a disruption 
threshold; (2) Scan protein using segments at least the average number of 
residues for that sphere stee or greater {e.g., >15 for 21 angstrom sphere); 
3) Check the disruption of all the compact fragments identified in step 2. if 
the fragment has a disruption above a threshold value, keep it; othera/ise, 
throw it out; 4) If the compact unit is disruptive, increment the schema 
disruption measure for ail of the residues in the fragment by one. This 
indicates that crossovers within the fragment are disfavored. 



FIG. 23 



22/25 




60 75 100 125 150 

Residue # 



120 




Residue # 



FIG. 24 




FIG. 25 




FIG. 26 



FC1VUS0I/16S31 



25115 

Total 




12 3 4 
parent 



Low Disruption 




FIG. 27B 



